policy_gradient → Policy gradient methods are a type ofreinforcement_learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) bygradient_descent.

Unlikepolicy_gradient methods, which attempt to learn functions which directly map an observation to an action,q_learning attempts to learn the value of being in a given state, and taking a specific action there.

openai_gym → Experiment with learning agent in an array of provided games.frozen_lake env consists of 4x4 grid of blocks. There will be

  • start block - Reward 0
  • goal block - R 1
  • safe frozen block
  • Death hole. the agent can move but the wind can also move the agent so perfect performance is impossible.

q_learning_table forfrozen_lake problem. 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table ofq_values #q_learning_table initially will have all zeros and then update after seeing reward for various actions.

q_learning_table is updated bybellman_equation which states that the expected long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. _In this way, we reuse our ownq_learning_table when estimating how to update our table for future actions.

bellman_equation → Q(s,a) = r + γ(max(Q(s’,a’)) r → reward γ → maximum discounted s’ → next state

q_learning_table implementation forfrozen_lake

import gym
import numpy as np
 
### Load the environment
env = gym.make('FrozenLake-v0')
 
### Implement Q-Table learning algorithm
#Initialize table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
    #Reset environment and get first new observation
    s = env.reset()
    rAll = 0
    d = False
    j = 0
    #The Q-Table learning algorithm
    while j < 99:
        j+=1
        #Choose an action by greedily (with noise) picking from Q table
        a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
        #Get new state and reward from environment
        s1,r,d,_ = env.step(a)
        #Update Q-Table with new knowledge
        Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
        rAll += r
        s = s1
        if d == True:
            break
    #jList.append(j)
    rList.append(rAll)
 
print "Score over time: " +  str(sum(rList)/num_episodes)
 
print "Final Q-Table Values"
print Q

q_learning with Neutral Networks

Tables can’t hold data for real world similar problems so we need Neutral Networks.

By acting as afunction_approximator, we can take any number of possible states that can be represented as a vector and learn to map them toq_values.

Reference: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0

While Neutral Networks allow for greater flexibility, they do so at the cost of stability when it comes to Q Learning.experience_replay andfreezing_target_networks allow for greater performance and more robust learning.

two_arm_bandit problem: The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it.

  • task Ques. So Neutral Networks will get us the result we want but will the machine know why it took every action it did?

learning_a_policy → learning which rewards we get for each of the possible actions, and ensuring we chose the optimal ones