policy_gradient → Policy gradient methods are a type ofreinforcement_learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) bygradient_descent.
Unlikepolicy_gradient methods, which attempt to learn functions which directly map an observation to an action,q_learning attempts to learn the value of being in a given state, and taking a specific action there.
openai_gym → Experiment with learning agent in an array of provided games.frozen_lake env consists of 4x4 grid of blocks. There will be
- start block - Reward 0
- goal block - R 1
- safe frozen block
- Death hole. the agent can move but the wind can also move the agent so perfect performance is impossible.
q_learning_table forfrozen_lake problem. 16 possible states (one for each block), and 4 possible actions (the four directions of movement), giving us a 16x4 table ofq_values #q_learning_table initially will have all zeros and then update after seeing reward for various actions.
q_learning_table is updated bybellman_equation which states that the expected long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. _In this way, we reuse our ownq_learning_table when estimating how to update our table for future actions.
bellman_equation → Q(s,a) = r + γ(max(Q(s’,a’)) r → reward γ → maximum discounted s’ → next state
q_learning_table implementation forfrozen_lake
import gym
import numpy as np
### Load the environment
env = gym.make('FrozenLake-v0')
### Implement Q-Table learning algorithm
#Initialize table with all zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
#Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0
#The Q-Table learning algorithm
while j < 99:
j+=1
#Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s,:] + np.random.randn(1,env.action_space.n)*(1./(i+1)))
#Get new state and reward from environment
s1,r,d,_ = env.step(a)
#Update Q-Table with new knowledge
Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a])
rAll += r
s = s1
if d == True:
break
#jList.append(j)
rList.append(rAll)
print "Score over time: " + str(sum(rList)/num_episodes)
print "Final Q-Table Values"
print Qq_learning with Neutral Networks
Tables can’t hold data for real world similar problems so we need Neutral Networks.
By acting as afunction_approximator, we can take any number of possible states that can be represented as a vector and learn to map them toq_values.
While Neutral Networks allow for greater flexibility, they do so at the cost of stability when it comes to Q Learning.experience_replay andfreezing_target_networks allow for greater performance and more robust learning.
two_arm_bandit problem: The goal is to discover the machine with the best payout, and maximize the returned reward by always choosing it.
- task Ques. So Neutral Networks will get us the result we want but will the machine know why it took every action it did?
learning_a_policy → learning which rewards we get for each of the possible actions, and ensuring we chose the optimal ones