Control as an inference problem

Does RL and optimal control provide a resonable model of human behavior?
- Is there a better explanation?

Can we derive this porblem as probabilistic inference?

How does it change our RL algorithms?

Humans (animals) make mistakes, but usually those mistakes still lead to a successful completion of task
- Some mistakes matter more than others
- behavior is stochastic
- But good behavior is still the most likely

Can we have a probabilistic model of good behavior?

We will introduce binary (True/False) Optimality Variables( $O_1, \dots, O_T$ ) that says: Is this person trying to be optimal at this point and time?

We will take this formula as given: (note that all of our rewards need to be negative)

Without loss of generality, just subtract reward by their maximum values.

p(O_t|s_t,a_t) = \exp(r(s_t,a_t))

Then our probabilisitic model:

\begin{split} p(\underbrace{\tau}_{\mathclap{s_{1:T},a_{1:T}}}|O_{1:T}) &= \frac{p(\tau, O_{1:T})}{p(O_{1:T})} \\ &\propto p(\tau) \prod_t \exp(r(s_t,a_t)) \\ &\quad = p(\tau) \exp(\sum_t r(s_t,a_t)) \end{split}

This can model suboptimal behavior (important for inverse RL)

Can apply inference algorithms to solve control and planning problems

Provides an explanation for why stochastic behavior might be preferred (useful for exploration and transfer learning)

To do inference:

Compute backward messages $\beta_t(s_t,a_t) = p(O_{t:T}|s_t,a_t)$
1. What’s the probability of being optimal from now until the end of the trajectory given the state and action we are in

Compute policy $p(a_t|s_t, O_{1:T})$

Compute forward messages $\alpha_t(s_t) = p(s_t | O_{1:t-1})$
1. What’s the probability of being in state $s_t$ given all previous timesteps are all optimal

Backward Messages

\begin{split} \beta_t(s_t,a_t) &= p(O_{t:T}|s_t,a_t) \\ &=\int p(O_{t:T},s_{t+1}|s_t,a_t) ds_{t+1} \\ &=\int p(O_{t+1:T}|s_{t+1}) p(s_{t+1}|s_t,a_t) p(O_t|s_t, a_t) ds_{t+1} \\ \end{split}

Note:

p(O_{t+1}|s_{t+1}) = \int \underbrace{p(O_{t+1:T}|s_{t+1}, a_{t+1})}_{\beta_{t+1}(s_{t+1},a_{t+1})} \underbrace{p(a_{t+1}|s_{t+1})}_{\text{which actions are likely a priori}} da_{t+1}

$p(a_{t+1}|s_{t+1})$ ⇒ Which actions are likely a priori: If we don’t know whether we are optimal or not, how likely are we to choose a particular action? We will assume uniform for now

Reasonable because:

Don’t know anything about the policy, reasonable to assume uniform

Can modify reward function later to impose non-uniformity

Action Prior in Backward Pass of Control as Inference in RL (1)

Therefore,

\begin{split} \beta_t(s_t,a_t) &=\int p(O_{t+1:T}|s_{t+1}) p(s_{t+1}|s_t,a_t) p(O_t|s_t, a_t) ds_{t+1} \\ &=p(O_t|s_t,a_t) \mathbb{E}_{s_{t+1}|s_t,a_t}[\underbrace{\beta_{t+1}(s_{t+1})}_{\mathclap{=\mathbb{E}_{a_t \sim p(a_t | s_t)}[\beta_t (s_t,a_t)]}}] \end{split}

This algorithm is called the backward pass, we calculate $\beta$ recursively from $t=T-1$ to $1$

We will take a closer look at the backward pass

Let $V_t(s_t) = \log \beta_t(s_t)$ , $Q_t (s_t, a_t) = \log \beta_t(s_t,a_t)$

V_t(s_t) = \log \int \exp(Q_t(s_t,a_t)) da_t

We see that $V_t(s_t) \to \max_{a_t} Q_t(s_t,a_t)$ as $Q_t(s_t,a_t)$ gets bigger ⇒ we call this a softmax (not the softmax in neural nets, but a soft relaxation of the max operator)

Let’s also evaluate $Q_t$

Q_t(s_t,a_t) = r(s_t,a_t) + \log \mathbb{E}[\exp(V_{t+1}(s_{t+1}))]

If we have determinimistic transition, then the update of this $Q_t$ is equal to the bellman operator

Q_t(s_t,a_t) = r(s_t,a_t) + V_{t+1}(s_{t+1})

But if we have non-deterministic transitions,. then this update of $Q_t$  will lead to optimistic transitions dominating the update - which is not a good idea

The reason we got this is because when we ask the optimality variable “How optimal are we at the time step”
- We are not differentiating between if we’re optimal because we got a lucky transition or if we performed the optimal action

Policy Computation

Policy ⇒ $p(a_t|s_t, O_{1:T})$ , what’s the probability of certain action given current state and that all timesteps should be optimal?

\begin{split} p(a_t|s_t,O_{1:T}) &= \pi(a_t|s_t) \\ &=p(a_t|s_t,O_{t:T}) \\ &= \frac{p(a_t,s_t|O_{t:T})}{p(s_t|O_{t:T})} \\ &=\frac{p(O_{t:T}|a_t,s_t) p(a_t,s_t)/p(O_{t:T})}{p(O_{t:T}|s_t) p(s_t) / p(O_{t:T})} \\ &=\frac{p(O_{t:T}|s_t,a_t)}{p(O_{t:T}|s_t)} \frac{p(a_t,s_t)}{p(s_t)} \\ &=\frac{\beta_t(s_t,a_t)}{\beta_t(s_t)} \underbrace{p(a_t|s_t)}_{\mathclap{\text{action prior assumed to be uniform}}} \\ &=\exp(Q_t(s_t,a-t) - V_t(s_t)) = \exp(A_t(s_t,a_t)) \\ &\approx \exp(\underbrace{\frac{1}{\alpha}}_{\mathclap{\text{$\alpha$ is the added temperature}}}A_t(s_t,a_t)) \end{split}

Natural interpretation:
- better actions are more probable

Random tie-breaking

Analogous to Boltzmann exploration

Approaches greedy policy as temperature decreases

Forward Messages

\begin{split} \alpha_t(s_t) &= p(s_t|O_{1:t-1}) \\ &=\int p(s_t,s_{t-1},a_{t-1}|O_{1:t-1}) ds_{t-1}da_{t-1} \\ &=\int p(s_t|s_{t-1},a_{t-1}, O_{1:t-1}) p(a_{t-1}|s_{t-1},O_{1:t-1}) p(s_{t-1}|O_{1:t-1}) ds_{t-1} da_{t-1} \\ &= \int p(s_t | s_{t-1}, a_{t-1})p(a_{t-1}|s_{t-1},O_{1:t-1}) p(s_{t-1}|O_{1:t-1}) ds_{t-1} da_{t-1} \\ \end{split}

Note:

\begin{split} p(a_{t-1}|s_{t-1}, O_{t-1}) p(s_{t-1}|O_{1:t-1}) &= \frac{p(O_{t-1}|s_{t-1},a_{t-1})\overbrace{p(a_{t-1}|s_{t-1})}^{\text{uniform}}}{p(O_{t-1}|s_{t-1})} \frac{p(O_{t-1}|s_{t-1}) \overbrace{p(s_{t-1}|O_{1:t-2})}^{\alpha_{t-1}(s_{t-1})}}{p(O_{t-1}|O_{1:t-2})} \\ &=\frac{p(O_{t-1}|s_{t-1},a_{t-1})}{p(O_{t-1}|O_{1:t-2})} \alpha_{t-1}(s_{t-1}) \end{split}

What if we want to know $p(s_t|O_{1:T})$ ?

\begin{split} p(s_t|O_{1:T}) &= \frac{p(s_t,O_{1:T})}{p(O_1:T)} \\ &= \frac{\overbrace{p(O_{t:T}|s_t)}^{\beta_t(s_t)} p(s_t, O_{1:t-1})}{p(O_{1:T})} \\ &\propto \beta_t(s_t) p(s_t|O_{1:t-1})p(O_{1:t-1}) \\ &\propto \beta_t(s_t) \alpha_t(s_t) \end{split}

Yellow cone shape is the beta, blue cone is the alpha

Control as Variational Inference

In continuous high-dimensional spaces we have to approximate

Inference problem:

p(s_{1:T}, a_{1:T}|O_{1:T})

Marginalizing and conditioning, we get the policy

\pi(a_t|s_t) = p(a_t|s_t,O_{1:T})

However,

p(s_{t+1}|s_t,a_t,O_{1:T}) \ne p(s_{t+1}|s_t,a_t)

Instead of asking

“Given that you obtained high reward, what was your action probability and your transition probability”

We want to ask

“Given that you obtained high reward, what was your action probability given that your transition probability did not change?”

❓

Can we find another distribution

q(s_{1:T}, a_{1:T})

that is close to

p(s_{1:T}, a_{1:T}|O_{1:T})

but has dynamics

p(s_{t+1}|s_t,a_t)

Let’s try variational inference!

Let $q(s_{1:T}, a_{1:T}) = p(s_1) \prod_{t} p(s_{t+1}|s_t,a_t) q(a_t | s_t)$

Let $x = O_{1:T}, z = (s_{1:T}, a_{1:T})$

The variational lower bound

\log p(x) \ge \mathbb{E}_{z \sim q(z)}[\log p(x,z) - \log q(z)]

Substituting in our definitions,

\begin{split} \log p(O_{1:T}) \ge &\mathbb{E}_{(s_{1:T}, a_{1:T}) \sim q}[\log p(s_1) + \sum_{t=1}^T \log p(s_{t+1}|s_t,a_t) + \sum_{t=1}^T \log p(O_t|s_t,a_t) \\ &-\log p(s_1) - \sum_{t=1}^T \log p(s_{t+1}|s_t, a_t) - \sum_{t=1}^T \log q(a_t|s_t)] \\ &= \mathbb{E}_{(s_{1:T}, a_{1:T}) \sim q}[\sum_t r(s_t, a_t) - \log q(a_t|s_t)] \\ &= \sum_t \mathbb{E}_{(s_t,a_t) \sim q}[r(s_t,a_t) + H(q(a_t|s_t))] \end{split}

⇒ maximize reward and maximize action entropy!

Optimize Variational Lower Bound

Base case: Solve for $q(a_T|s_T)$

\begin{split} q(a_T|s_T) &= \argmax \mathbb{E}_{s_T \sim q(s_T)}[\mathbb{E}_{a_T \sim q(a_T|s_T)}[r(s_T, a_T)] + H(q(a_T|s_T))] \\ &= \argmax \mathbb{E}_{s_T \sim q(s_T)}[\mathbb{E}_{a_T \sim q(a_T|s_T)}[r(s_T, a_T) - \log q(a_T|s_T)]] \end{split}

optimized when $q(a_T|s_T) \propto \exp(r(s_T,a_T))$

q(a_T|s_T) = \frac{\exp(r(s_T,a_T))}{\int \exp(r(s_T, a)) da} = \exp(Q(s_T,a_T) - V(s_T))

V(s_T) = \log \int \exp(Q(s_T,a_T)) da_T

Therefore

\mathbb{E}_{s_T \sim q(s_T)}[\mathbb{E}_{a_T \sim q(a_T|s_T)}[r(s_T, a_T) - \log q(a_T|s_T)]] = \mathbb{E}_{s_T \sim q(s_T)}[\mathbb{E}_{a_T \sim q(a_T|s_T)}[V(s_T)]]

⚠️

Dynamic Programming Solution!

Levine (2018). Reinforcement Learning and Control as Probabilistic Inference

Q-Learning with softoptimality

Standard Q-Learning:

\phi \leftarrow \phi + \alpha \nabla_\phi Q_\phi(s,a)(r(s,a) + \gamma V(s') - Q_\phi(s,a))

Standard Q-Learning Target

V(s') = \max_{a'} Q_\phi(s',a')

Soft Q-Learning

\phi \leftarrow \phi + \alpha \nabla_{\phi} Q_\phi(s,a)(r(s,a) + \gamma V(s') - Q_\phi(s,a))

Soft Q-Learning Target

V(s') = \text{soft max}_{a'} Q_\phi(s',a') = \log \int \exp (Q_\phi(s',a')) da'

Policy

\pi(a|s) = \exp(Q_\phi(s,a) - V(s)) = \exp(A(s,a))

Policy Gradient with Soft Optimality (”Entropy regularized” policy gradient)

\pi(a|s) = \exp(Q_\phi(s,a) - V(s))

this policy optimizes $\sum_t \mathbb{E}_{\pi(s_t,a_t)}[r(s_t, a_t)] + \mathbb{E}_{\pi(s_t)}[H(\pi(a_t|s_t))]$

Intuition:

$\pi(a|s) \propto \exp(Q_\phi(s,a))$ when $\pi$ minimizes $D_{KL}(\pi(a|s) || \frac{1}{Z} \exp(Q(s,a)))$

$D_{KL}(\pi(a|s) || \frac{1}{Z} \exp(Q(s,a))) = \mathbb{E}_{\pi(a|s)} [Q(s,a) ] - H(\pi)$

Combats premature entropy collapse

Benefits of soft optimality

Improve exploration and prevent entropy collapse

Easier to specialize (finetune) policies for more speicific tasks

Principled approach to break ties

Better robustness (due to wider coverage of states)

Can reduce to hard optimality as reward magnitude increases

Good model for modeling human behavior