We want good exploration to gather more data to address the distribution shift, but with a not-so-good early model we cannot deploy good exploration techniques
Basically the model early on overfits an function approximator and the exploration policy is stuck on a local minima / maxima
Uncertainty Estimation
Uncertainty estimation helps us reduce performance gap.
Under this trick, the expectation of reward under high-variance prediction is very low, even though the mean is the same
However:
We still need exploration to get better model
Expected value is not the same as pessimistic value
But how to build it?
Use NN?
Cannot express uncertainty of unseen data accurately ⇒ when querying out-of-sample data, the uncertainty output is arbitrary
“The model is certain about data, but we are not certain about the model”
Estimate Model Uncertainty
Estimate argmaxθlogP(θ∣D)=argmaxθlogP(D∣θ)
Can we instead estimate the distribution P(θ∣D)?
We can use this to get a posterior distribution over the next state
∫p(st+1∣st,at,θ)p(θ∣D)dθ
Bayesian Neural Nets
Every weight, instead of being a number, is a distribution
But it’s difficult to join parameters into a single distribution
So we do approximation
p(θ∣D)=∏ip(θi∣D)
Very Crude approximation
p(θi∣D)=N(μi,σi2)
Instead of learning numerical value learn mean value and std/variance
Bllundell et al. Weight Uncertainty in Neural Networks
Gal et al. Concrete Dropout
Bootstrap Ensembles
Train several different neural nets and bootstrap samples(Di sample with replacement from D) to feed to those networks
So each network learns slightly differently
In theory each NN learns well on their own training data but makes different mistakes outside of training data
Because the number of models is usually small(expensive)
But works
Resampling with replacement is usually unnecessary, because SGD and random initialization usually makes the models sufficiently independent
With Complex Observations (images)
High dimensionality with observation prediction model
Redundancy
Partial Observability
Can we seperately learn p(ot∣st), which is high-dim but not dynamic, and p(st+1∣st,at), which is low-dim but dynamic?
Now we need to maximize
Note that the expectation is conditioned on
So can we learn an additional approximate posterior (encoder):
Also we have other choices:
qψ(st,st+1∣o1:T,a1:T) - full smoothing posterior
most accurate
most complicated to train
Preferred when the state is less fully-observed
qψ(st∣ot)
single-step encoder
simplest but less accurate
stochastic case requires variational inference
Deterministic: qψ(st∣ot)=δ(st=gψ(ot))
To use uncertainty
Before uncertainty is introduced, we have (assuming st+1=f(st,at)
Now(With bootstrap):
Other options:
Moment matching
more complex posterior estimation with BNNs
Further Readings
Classic
Deisenroth et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search.
Recent papers:
Nagabandi et al. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning.
Chua et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models.
Feinberg et al. Model-Based Value Expansion for Efficient Model-Free Reinforcement Learning.
Buckman et al. Sample-Efficient Reinforcement Learning
with Stochastic Ensemble Value Expansion.
Can we just backprop through policy / dynamics model?
Similar parameter sensitivity problems as shooting methods
But no longer have convenient second order LQR-like method, because policy parameters couple all the time steps
Similar problems to training long RNNs with BPTT
Vanishing and exploding gradients
Let’s compare policy gradient:
with backprop (pathwise) gradient:
Policy gradient might be more stable (if enough samples are used) because it does not require multiplying many Jacobians
Parmass et al. “18: PIPP: Flexible Model-Based Policy Search Robust to the Curse of Chaos”
It we use policy gradient to backprop through our model dynamics model, there would still be a problem (although generating samples does not require interacting with physical world or simulator)
But this does not solve the problem because the point of using model-based RL is being sample efficient and using sampled real world data and do short-rollout open-loop planning restrict how much we can change our policy.
Model-based RL with Policies
Model-Based Acceleration (MBA)
Gu et al. Continuous deep Q-learning with model-based acceleration. ‘16
Model-Based Value Expansion (MVE)
Feinberg et al. Model-based value expansion. ’18
Model-Based Policy Optimization (MBPO)
Janner et al. When to trust your model: model-based policy optimization. ‘19
Good Idea:
Sample Efficient
Bad Idea:
Additional Source of bias if model is not correct
Reduce this by using ensemble methods
If not seen states before then we still cannot perform valid action
Multi-step Models and Successor Representations
Therefore
Where
is the “successor representation”
Dayan Improving Generalization for Temporal Difference Learning: The Successor Representation, 1993.
Hmm…looks like a bellman backup with “reward” r(st)=1−γδ(st=i)
But issues:
Not clear if learning successor representation is easier than model-free RL
how to scale to large state spaces?
How to extend to continuous state spaces?
To scale to large state spaces, use successor feature of st
If we can express the reward function as combination of features of state s
Turns out we can express the value function as
The dimensionality of ϕ and ψ would be a lot lower than the dimension of possible states
Can also construct a Q-function like version
(when r(s)=ϕ(s)⊤w)
Using Successor Features
Recover a Q-function very quickly
Train ψπ(st,at) via Bellman backups
Get some reward samples {si,ri}
Get w←argminw∑i∣∣ϕ(si)⊤w−ri∣∣2
Recover Qπ(st,at)≈ψπ(st,at)⊤w
π’(s)=argmaxaψπ(s,a)⊤w
This is only equivalent to one step of policy iteration because Q is with respect to the actor
Recover many Q-function
Train ψπk(st,at) for many policies πk via Bellman backups
Get some reward samples {si,ri}
Get w←argminw∑i∣∣ϕ(si)⊤w−ri∣∣2
Recover Qπk(st,at)≈ψπk(st,at)⊤w for every πk
π’(s)=argmaxamaxkψπk(s,a)⊤w
This is only equivalent to one step of policy iteration because Q is with respect to the actor
Finding to highest reward policy in each state
Continuous Successor Representations (C-Learning)
Positive set during training:
Negative set just pulled randomly.
To train (On-policy algorithm, can also derive an off-policy algorithm that is also very efficient):
Sample s∼pπ(s) ⇒ run policy, sample from trajectories