Value-based (and Q-) Learning
Value Function / Q Function Fitting (Value-Based)
Can we omit policy gradient completely from actor-critic methods?
Actor-critic minimizes
But the best policy can just be extracted from
We know that is as good as , and probably better.
Same as before,
Policy Iteration (Fitted Value Iteration)
So fitting value function can be done if we know the transition dynamics very easily using least square estimate… but what if we don’t know transition dynamics?
Policy Iteration without transition probability (Fitted Q-Iteration)
What if we change the policy evalution part to
Now ⇒ if we have an tuple then even if policy changes its action in state we can still do policy evalution on the tuple because the only place the policy shows up is inside the expectation.
This simple change allows us to policy iteration style algorithms without actually knowing the transition dynamics
Instead of using , we used , the value function is approxiamted by and intead of doing expectations on the future possible states, we used the next state in the sample
- Works for off-policy samples
- Only one network, no high-variance policy gradient
- No convergence guarentees for non-linear function approximation (more on this later)
Special Cases of Fixed Q-Fitting
During exploration phase, using the greedy algorithm may not be the optimal choice (we want to maximize entropy!)
“Epsilon Greedy”
“epsilon chance of exploring other options other than optimal action”
Common practice is to vary epsilon while training (early on(Q is bad), larger epsilon)
“Exponential”
Best action will be most frequent, but leave some probability for exploration. And plus, if there is a second largest Q, then this second largest is considered more than other options.
Value Function Learning Theory
For tabular value iteration algorithm:
The Bellman Operator encaptures the logic of updating
And one fact is that:
, the value function for the optimal policy, is a fixed point of
always exists, is always unique, and always corresponds to the optimal policy
Does fixed point iteration converge to
We can prove that the value iteration reaches because is a contraction
Contraction means: for any and , we have where
Meaning that and will get closer and closer as we apply to them
If we replace by (and since ),
For fitted value iteration algorithm:
Step 1 basically applies the bellman operator
We will define a new operator for step 2, (pi)
Step 2 performs the following:
“Find an value function in the value function model hypothesis space that optimizes the objective”
So step 2 performs this model fitting operator that projects onto hypothesis space , our iteration algorithm can now be described as
is still a contraction with respect to norm
is a contraction wr.t. L2-norm
However, is not a contraction of any kind
So fitted value iteration does not converge in general and often does not converge in practice
Same with fitted Q-Iteration, Batch actor-critic
But why? Isn’t value function fitting just graident descent?
We can actually turn this into a gradient descent algorithm by computing into those target functions ⇒ resulting algorithm is called residual algorithms, has poor numerical properties & doesn’t work well in practice
Correlation Problem in Q-Learning
Think about a sequence
In the early 3 steps, the target value is kind of “overfit” to those 3 values
In the supervised setting of learning Q / Value functions, the data points seen by the target values are only local states
We can solve this by adding more batches at one time (by synchronized parallel Q-learning or asynchronous parallel Q-learning)
Or we use replay buffers (Since fitted Q-Learning is basically a off-policy method) ⇒ Samples are no longer correlated and multiple samples in the batch (low-variance gradient)
But where does the data come from:
- Need to periodically feed the replay buffer
Maybe just move one gradient step instead
Usually ⇒ ,
Target network makes sure that we’re not trying to hit a moving target ⇒ and now it looks more like supervised regression
Question: Chasing its tail relationship, is there a typical scenerio where Q function chases its tail and shows bad behaviors?
Classic Deep Q-Learning Algorithm (DQN)
A special case(with specific parameters) of Q-learning with replay buffer and target network
A general view of Q-Learning
Online Q-learning (last lecture): evict immediately, process 1, process 2, and process 3 all run at the same speed.
DQN: Process 1 and process 3 run at the same speed, process 2 is slow.
Fitted Q-iteration: process 3 in the inner loop of process 2, which is in the inner loop of process 1
Overestimation in Q-Learning and Double Q-Learning
Because target value
Imagine two r.v.
We can prove:
But:
is not perfect, it is “noisy” ⇒ We are always selecting the positive errors while training using the
Note that
So use “double Q-Learning” ⇒ use two networks
Update by using as a target network and as an action selector, opposite with update.
Intuition: Choosing the “optimal” (but due to noise) action in will cause the noise of the same action in to be added into the result, therefore the “overestimation” effect is mitigated
In practice, instead of having two Q functions, use the two Q functions that we already have.
We already have and
In standard Q-learning:
Double Q-Learning:
Multi-step returns
Q-learning target
Early on in training, the term only gives us random noise if the network is bad, so is the most important. But later in training, when gets better and better, this term dominates(since we have more terms then just a single term reward) and is more important.
So we can have something like “Monte Carle” sum of rewards in Actor-Critic algorithms
- Higher variance
- but lower bias
- Faster learning especially early on
- but only correct when on-policy learning
- In the second step it actually matters what action you take because the exploration policy might be different from current policy
How to fix the “on-policy” limit?
- Ignore the problem (😂 Yeah this is SOOOO CS)
- Often works very well
- Cut the trace
- dynamically choose N to get only on-policy data
- Works well when data mostly on-policy, and action space is small
- Importance sampling
- “Safe and efficient off-policy reinforcement learning. “ Munos et al. ‘16
Extending Q-Learning to continuous action space
Problem with continuous actions:
- Our policy has to select
- When evaluting the for training we also need to
- Particularly problematic(since this runs in a loop!)
Options:
- Continuous Optimization Procedure
- Gradient based optimization (like SGD) is a bit slow in the inner loop
- Action space typically low-dimensional ⇒ So no need to do gradient optimization
- Stochastic optimization
- Uses the fact that action space is low-dimensional (easy to solve for optima)
- Use a function class that is easy to optimize
- Quadratic function?
- Gu, Lillicrap, Sutskever, L., ICML 2016
- Normalized Advantage Functions (NAF) architecture
- Gu, Lillicrap, Sutskever, L., ICML 2016
- No change to algorithm
- Just as efficient as Q-Learning
- But loses representational power
- Quadratic function?
- Learn an approximate maximizer
- Learn a function maximizer using neural nets
- DDPG (Lillicrap et al. ICLR 2016)
- “Deterministic” actor-critic (really approximate Q-Learning)
-
- Idea is to train another network such that
-
- classic: NFQCA
- recent: TD3 and SAC
Stochastic Optimization
Simple Solution
Where are sampled from some distribution (e.g. uniform)
- Dead simple
- Efficiently parallelizable
- But - not very accurate
More accurate (works OK for ≥ 40 dims):
- Cross-entropy method (CEM)
- simple iterative stochastic optimization
- sampling actions from distribution like in the simple method but then refines to a range and then continues to sample in a smaller and smaller range
- CMA-ES
- substantially less simple iterative stochastic optimization
Implementation Tips
- Q-learning takes some care to stabilize
- Test on easy, reliable tasks first, make sure the implementation is correct
- Large replay buffers help improve stability
- Looks more like fitted Q-iteration
- Takes time, be patient - might be no better than random for a while
- Start with high exploration (epsilon) and gradually reduce
- Bellman error gradients can be big; clip gradients or use Huber loss
- Double Q-learning helps a lot in practice
- Simple & no downsides!
- N-step returns also help a lot
- but have some downsides (will systematically bias the objective)
- Schedule exploration (high to low) and learning rates (high to low), Adam optimizer can help too
- Run multiple random seeds, its very inconsistent between runs
Suggested Readings
- Classic papers
- Watkins. (1989). Learning from delayed rewards: introduces Q-learning
- Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks
- Deep reinforcement learning Q-learning papers
- Lange, Riedmiller. (2010). Deep auto-encoder neural networks in reinforcement learning: early image-based Q-learning method using autoencoders to construct embeddings
- Mnih et al. (2013). Human-level control through deep reinforcement learning: Q- learning with convolutional networks for playing Atari.
- Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.
- Lillicrap et al. (2016). Continuous control with deep reinforcement learning: continuous Q-learning with actor network for approximate maximization.
- Gu, Lillicrap, Stuskever, L. (2016). Continuous deep Q-learning with model-based acceleration: continuous Q-learning with action-quadratic value functions.
- Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling network architectures for deep reinforcement learning: separates value and advantage estimation in Q-function.