Q Learning

policy_gradient → Policy gradient methods are a type ofreinforcement_learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) bygradient_descent.

Unlikepolicy_gradient methods, which attempt to learn functions which directly map an observation to an action,q_learning attempts to learn the value of being in a given state, and taking a specific action there.

openai_gym → Experiment with learning agent in an array of provided games.frozen_lake env consists of 4x4 grid of blocks. There will be

start block - Reward 0
goal block - R 1
safe frozen block
Death hole. the agent can move but the wind can also move the agent so perfect performance is impossible.

q_learning_table forfrozen_lake problem. 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table ofq_values #q_learning_table initially will have all zeros and then update after seeing reward for various actions.

q_learning_table is updated bybellman_equation which states that the expected long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. _In this way, we reuse our ownq_learning_table when estimating how to update our table for future actions.

bellman_equation → Q(s,a) = r + γ(max(Q(s’,a’)) r → reward γ → maximum discounted s’ → next state

q_learning_table implementation forfrozen_lake

import gym
import numpy as np
 
### Load the environment
env = gym.make('FrozenLake-v0')
 
### Implement Q-Table learning algorithm
#Initialize table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
    #Reset environment and get first new observation
    s = env.reset()
    rAll = 0
    d = False
    j = 0
    #The Q-Table learning algorithm
    while j < 99:
        j+=1
        #Choose an action by greedily (with noise) picking from Q table
        a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
        #Get new state and reward from environment
        s1,r,d,_ = env.step(a)
        #Update Q-Table with new knowledge
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:
            break
    #jList.append(j)
    rList.append(rAll)
 
print "Score over time: " +  str(sum(rList)/num_episodes)
 
print "Final Q-Table Values"
print Q

q_learning with Neutral Networks

Tables can’t hold data for real world similar problems so we need Neutral Networks.

By acting as afunction_approximator, we can take any number of possible states that can be represented as a vector and learn to map them toq_values.

Reference: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

While Neutral Networks allow for greater flexibility, they do so at the cost of stability when it comes to Q Learning.experience_replay andfreezing_target_networks allow for greater performance and more robust learning.

two_arm_bandit problem: The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it.

task Ques. So Neutral Networks will get us the result we want but will the machine know why it took every action it did?

learning_a_policy → learning which rewards we get for each of the possible actions, and ensuring we chose the optimal ones

Knowledge Base | Daily Notes

Explorer

Q Learning

q_learning_table implementation forfrozen_lake

q_learning with Neutral Networks

Graph View

Backlinks