What’s the probability of being optimal from now until the end of the trajectory given the state and action we are in
Compute policy p(at∣st,O1:T)
Compute forward messages αt(st)=p(st∣O1:t−1)
What’s the probability of being in state st given all previous timesteps are all optimal
Backward Messages
Note:
p(at+1∣st+1) ⇒ Which actions are likely a priori: If we don’t know whether we are optimal or not, how likely are we to choose a particular action? We will assume uniform for now
Reasonable because:
Don’t know anything about the policy, reasonable to assume uniform
Can modify reward function later to impose non-uniformity
Therefore,
This algorithm is called the backward pass, we calculate β recursively from t=T−1 to 1
We will take a closer look at the backward pass
Let Vt(st)=logβt(st), Qt(st,at)=logβt(st,at)
We see that Vt(st)→maxatQt(st,at) as Qt(st,at) gets bigger ⇒ we call this a softmax (not the softmax in neural nets, but a soft relaxation of the max operator)
Let’s also evaluate Qt
If we have determinimistic transition, then the update of this Qt is equal to the bellman operator
But if we have non-deterministic transitions,. then this update of Qt will lead to optimistic transitions dominating the update - which is not a good idea
The reason we got this is because when we ask the optimality variable “How optimal are we at the time step”
We are not differentiating between if we’re optimal because we got a lucky transition or if we performed the optimal action
Policy Computation
Policy ⇒ p(at∣st,O1:T), what’s the probability of certain action given current state and that all timesteps should be optimal?
Natural interpretation:
better actions are more probable
Random tie-breaking
Analogous to Boltzmann exploration
Approaches greedy policy as temperature decreases
Forward Messages
Note:
What if we want to know p(st∣O1:T)?
Control as Variational Inference
In continuous high-dimensional spaces we have to approximate
Inference problem:
Marginalizing and conditioning, we get the policy
However,
Instead of asking
“Given that you obtained high reward, what was your action probability and your transition probability”
We want to ask
“Given that you obtained high reward, what was your action probability given that your transition probability did not change?”
Let’s try variational inference!
Let q(s1:T,a1:T)=p(s1)∏tp(st+1∣st,at)q(at∣st)
Let x=O1:T,z=(s1:T,a1:T)
The variational lower bound
Substituting in our definitions,
⇒ maximize reward and maximize action entropy!
Optimize Variational Lower Bound
Base case: Solve for q(aT∣sT)
optimized when q(aT∣sT)∝exp(r(sT,aT))
Therefore
Levine (2018). Reinforcement Learning and Control as Probabilistic Inference
Q-Learning with softoptimality
Standard Q-Learning:
Standard Q-Learning Target
Soft Q-Learning
Soft Q-Learning Target
Policy
Policy Gradient with Soft Optimality (”Entropy regularized” policy gradient)
this policy optimizes ∑tEπ(st,at)[r(st,at)]+Eπ(st)[H(π(at∣st))]
Intuition:
π(a∣s)∝exp(Qϕ(s,a)) when π minimizes DKL(π(a∣s)∣∣Z1exp(Q(s,a)))
Easier to specialize (finetune) policies for more speicific tasks
Principled approach to break ties
Better robustness (due to wider coverage of states)
Can reduce to hard optimality as reward magnitude increases
Good model for modeling human behavior
Suggested Readings (Soft Optimality)
Todorov. (2006). Linearly solvable Markov decision problems: one framework for reasoning about soft optimality.
Todorov. (2008). General duality between optimal control and estimation: primer on the equivalence between inference and control.
Kappen. (2009). Optimal control as a graphical model inference problem: frames control as an inference problem in a graphical model.
Ziebart. (2010). Modeling interaction via the principle of maximal causal entropy: connection between soft optimality and maximum entropy modeling.
Rawlik, Toussaint, Vijaykumar. (2013). On stochastic optimal control and reinforcement learning by approximate inference: temporal difference style algorithm with soft optimality.
Haarnoja*, Tang*, Abbeel, L. (2017). Reinforcement learning with deep energy based models: soft Q-learning algorithm, deep RL with continuous actions and soft optimality
Nachum, Norouzi, Xu, Schuurmans. (2017). Bridging the gap between value and policy based reinforcement learning.
Schulman, Abbeel, Chen. (2017). Equivalence between policy gradients and soft Q-learning.
Haarnoja, Zhou, Abbeel, L. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
Levine. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review