Q^i,t is the estimate of expected reward if we take action ai,t in state si,t
But Q^i,t currently has a very high variance
Q^i,t is only taking account of a specific chain of action-state pairs
This is because we approximated the gradient by stripping away the expectation
We can get a better estimate by using the true expected reward-to-go
What about baseline? Can we apply a baseline even if we have a true Q function?
Turns out that we can do better(less variance) than applying bt=N1∑iQ(si,t,ai,t) because with the value function, the baseline can be dependent on the state.
So now our gradient becomes
We will also name something new: the advantage function:
How much better the action ai,t is than the average action
The better estimate Aπ estimate has, the lower variance ∇θJ(θ) has.
Fitting Q/Value/Advantage Functions
But problem is, what should we fit? Qπ,Vπ,Aπ?
Let’s do some approximation and find out
We will introduce what γ(discount factor) is later. Just view it as γ=1 for now.
So now we see that it is now easy to fit Vπ and use Vπ to approximate Qπ and Aπ.
Vπ is relatively easier to fit to because it does not involve action, only depends on state.
Policy Evaluation
This is what policy gradient does
Ideally we want to do this:
Because in a model-free setting we cannot reset back to a state and run multiple trials
Expectation of rewards
So…
Monte Carlo policy evaluation
Use empirical returns to train the a value function to approximate the expectation
Instead of using those rewards directly to a policy gradient we will fit a model to those rewards ⇒ will reduce variance
Because even though we cannot visit the same state twice, the function approximator will combine information with similar states
And we can of course use MSE, etc. - supervised training losses
Ideal target
Monte Carlo target:
Training data would be:
We can do even better (bootstrapped estimate):
Hmmm… looks like we can modify a bit in the ideal target
Since we don’t know Vπ, we can approximate it by V^ϕπ(si,t+1) - our previous value function approximator (bootstraped estimate)
So now training data:
Batch actor-critic algorithm
Discount Factor
If T→∞, V^ϕπ ⇒ the approximator for Vπ can get infinitely large in many cases, so we can regulate the reward to be “sooner rather than later”.
Where γ∈[0,1] is the discount factor, it let’s reward you get decay in every timestep ⇒ So that the obtainable reward in infinite lifetime can actually be bounded.
One understanding that γ affects policy is that γ adds a “death state” that once you’re in, you can never get out.
Online Actor-critic Algorithm
In practice:
Off-policy Actor-critic algorithms
Idea: Collect data and instead of training on them directly, put them into a replay buffer. At training instead of using the data just collected, fetch one randomly from the replay buffer.
Coming from the online algorithm, let’s see what problems do we need to fix:
(1) Under current policy, it is possible that our policy would not even have taken the next action ai and therefore it’s not cool to assume we will arrive at the reward r(si,ai,si′) ⇒ We may not even arrive at state si’
(2) Same reason, we may not have taken ai as our action under the current policy
We can fix the problem in (1) by using Qπ(st,at) ⇒ replace the term γV^ϕπ(si’)
Now
And we replace the target value
Same for (2), sample an action from current policy aiπ∼πθ(ai∣si) rather than using the original data
And instead of plugging in advantage function,
It’s fine to have high variance (not being baselined), because it’s easier and now we don't need to generate more states ⇒ we can just generate more actions
In exchange use a larger batch size ⇒ all good!
Now our final result:
Example Practical Algorithm:
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. 2018.
Critics as state-dependent baselines
In actor-critic
This method of using fitted model to approximate value/Q/Advantage Function:
Lowers variance
Biased as long as the critic is not perfect
In policy gradient:
This method:
No bias
High variance
So we can use a state-dependent baseline to still keep the estimator unbiased but reduce a bit variance?
Not only does the policy gradient remain unbiased when you subtract any constant b, it still remains unbiased when you subtract any function that only depends on the state si (and not on the action)
Control variates - methods that use state and action dependent baselines
No bias
Higher variance (because single-sample estimate)
Goes to zero in expectation if critic is correct
If critic is not correct, bomb shakalaka
The expectation integrates to an error term that needs to be compensated for
To account for the error in baseline, we modify it to:
Use a critic without the bias, provided second term can be evaluated
Gu et al. 2016 (Q-Prop)
Eligibility traces & n-step returns
Again, the Critic advantage estimator
Lower variance
higher bias if value estimate is wrong
The Monte Carlo Advantage Estimator
unbiased
higher variance (single-sample estimate)
Get something in-between?
Facts:
rewards are smaller as t’→∞
So biases are much smaller problems when t’ is big
Variance is more of a problem in the future
Generalized advantage estimation
The n-step advantage estimator is good, but can we generalize(hybrid) it?
Instead of hard cutting between two estimators, why don’t we combine them everywhere?
Where A^nπ stands for the n-step estimator
“Most prefer cutting earlier (less variance)”
So we set wn∝λn−1 ⇒ exponential falloff
Then
Examples of actor-critic algorithms
Schulman, Moritz, Levine, Jordan, Abbeel ‘16. High dimensional continuous control with generalized advantage estimation
Batch-mode actor-critic
Blends Monte Carlo and GAE
Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu ‘16. Asynchronous methods for deep reinforcement learning
Online actor-critic, parallelized batch
CNN end-to-end
N-step returns with N=4
Single network for actor and critic
Suggested Readings
Classic: Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation: actor-critic algorithms with value function approximation
Talks about contents in this class
Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu (2016). Asynchronous methods for deep reinforcement learning: A3C -- parallel online actor-critic
Schulman, Moritz, Levine, Jordan, Abbeel (2016). High-dimensional continuous control using generalized advantage estimation: batch-mode actor-critic with blended Monte Carlo and function approximator returns
Gu, Lillicrap, Ghahramani, Turner, Levine (2017). Q-Prop: sample-efficient policy-gradient with an off-policy critic: policy gradient with Q-function control variate