What’s the probability of being optimal from now until the end of the trajectory given the state and action we are in
Compute policy p(at∣st,O1:T)
Compute forward messages αt(st)=p(st∣O1:t−1)
What’s the probability of being in state st given all previous timesteps are all optimal
Backward Messages
p(at+1∣st+1) ⇒ Which actions are likely a priori: If we don’t know whether we are optimal or not, how likely are we to choose a particular action? We will assume uniform for now
Reasonable because:
Don’t know anything about the policy, reasonable to assume uniform
Can modify reward function later to impose non-uniformity
This algorithm is called the backward pass, we calculate β recursively from t=T−1 to 1
We will take a closer look at the backward pass
Let Vt(st)=logβt(st), Qt(st,at)=logβt(st,at)
We see that Vt(st)→maxatQt(st,at) as Qt(st,at) gets bigger ⇒ we call this a softmax (not the softmax in neural nets, but a soft relaxation of the max operator)
Let’s also evaluate Qt
If we have determinimistic transition, then the update of this Qt is equal to the bellman operator
But if we have non-deterministic transitions,. then this update of Qt will lead to optimistic transitions dominating the update - which is not a good idea
The reason we got this is because when we ask the optimality variable “How optimal are we at the time step”
We are not differentiating between if we’re optimal because we got a lucky transition or if we performed the optimal action
Policy Computation
Policy ⇒ p(at∣st,O1:T), what’s the probability of certain action given current state and that all timesteps should be optimal?
Natural interpretation:
better actions are more probable
Random tie-breaking
Analogous to Boltzmann exploration
Approaches greedy policy as temperature decreases
Forward Messages
What if we want to know p(st∣O1:T)?
Control as Variational Inference
In continuous high-dimensional spaces we have to approximate
Inference problem:
Marginalizing and conditioning, we get the policy
Instead of asking
“Given that you obtained high reward, what was your action probability and your transition probability”
We want to ask
“Given that you obtained high reward, what was your action probability given that your transition probability did not change?”
Let’s try variational inference!
Let q(s1:T,a1:T)=p(s1)∏tp(st+1∣st,at)q(at∣st)
Let x=O1:T,z=(s1:T,a1:T)
The variational lower bound
Substituting in our definitions,
⇒ maximize reward and maximize action entropy!
Optimize Variational Lower Bound
Base case: Solve for q(aT∣sT)
optimized when q(aT∣sT)∝exp(r(sT,aT))
Levine (2018). Reinforcement Learning and Control as Probabilistic Inference
Q-Learning with softoptimality
Standard Q-Learning:
Standard Q-Learning Target
Soft Q-Learning
Soft Q-Learning Target
Policy Gradient with Soft Optimality (”Entropy regularized” policy gradient)
this policy optimizes ∑tEπ(st,at)[r(st,at)]+Eπ(st)[H(π(at∣st))]
π(a∣s)∝exp(Qϕ(s,a)) when π minimizes DKL(π(a∣s)∣∣Z1exp(Q(s,a)))
Easier to specialize (finetune) policies for more speicific tasks
Principled approach to break ties
Better robustness (due to wider coverage of states)
Can reduce to hard optimality as reward magnitude increases
Good model for modeling human behavior
