IRL ⇒ Inverse Reinforcement Learning

So far, we’ve hand designed reward function to define a task

What if we want to learn the reward function from observing an expert
- and then apply reinforcement learning through the learned reward function?

Apply approximate optimality model(Control as an Inference model) from last time, but now learn the reward

📢

Inverse Reinforcement Learning: Infer reward functions from demonstrations

Standard Imitation Learning

Copy the actions performed by the expert

No reasoning about outcomes of actions

Human Imitation Learning

Copy the intent of the expert

might take very different actions!

Problem: many reward functions can explain the same behavior

Top-left: behavior, top-center: possible reward block, top-right: possible reward block, bottom-left: possoble reward blocks

Reward Parameterization

Traditional Linear Formulation
- $r_\psi(s,a) = \sum_i \psi_i f_i(s,a) = \psi^{\top} f(s,a)$
- $f(s,a)$ feature function

Neural Net Formulation
- $r_\psi(s,a)$

Feature Matching IRL

Linear Reward Function

r_\psi(s,a) = \sum_i \psi_i f_i(s,a) = \psi^{\top} f(s,a)

If features $f$ are important, what if we match their expectations?

Let $\pi^{r_\psi}$ be the optimal policy for $r_\psi$

If we pick $\psi$ such that $\mathbb{E}_{\pi^{r_\psi}}[f(s,a)] = \mathbb{E}_{\pi^*}[f(s,a)]$ , It’s ambiguous because

We can estimate the optimal policy expectation by averaging expert samples

But Multiple different $\psi$ vectors can still result in the same feature expectations

So now if we choose to use max margin principle (similar to SVM):

\max_{\psi, m} m \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^{\top} \mathbb{E}_\pi [f(s,a)] + m

It’s a heuristic ⇒ not necessarily mean we’ll recover the true weight of expert’s reward function, but it’s a reasonable heuristic.

But we need to somehow weight the margin by similarity between $\pi^*$ and $\pi$ because in a continuous policy space there will probably be policies very similar to the optimal policy.

apply the “SVM trick”, the problem becomes

\min_\psi \frac{1}{2} ||\psi||^2 \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + 1

Let’s also add in a measure of difference in policy

\min_\psi \frac{1}{2} ||\psi||^2 \\ \text{s.t.} \\ \psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + D(\pi, \pi^*)

Note:

$D(\cdot)$ can be either difference in feature expectations or KL divergence

Issues:

Maximizing the margin is a bit arbitrary
- “Find a reward function which the expert’s policy is clearly better than any other policies”

No clear model of expert sub-optimality
- Can add slack variables just like in SVM to allow sub-optimality

Messy constrained optimization problem
- Not good for deep learning

Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning Ratliff et al: Maximum margin planning

Optimal Control as a Model of Human Behavior

p(s_{1:T}, a_{1:T}) = ?

What we know:

p(O_t|s_t,a_t) = \exp(r(s_t,a_t))

But we can do optimality inference (What is the probability of trajectory given optimality)

\begin{split} p(\tau|O_{1:T}) &= \frac{p(\tau, O_{1:T})}{p(O_{1:T})} \\ &\propto p(\tau) \prod_t \exp(r(s_t,a_t)) \\ &\quad = p(\tau) \exp(\sum_{t} r(s_t,a_t)) \end{split}

Learning the optimality variable

p(O_t|s_t,a_t,\psi) = \exp(r_\psi(s_t,a_t))

p(\tau|O_{1:T}, \psi) \propto \underbrace{p(\tau)}_{\mathclap{\text{turns out we can ignore because independent of $\psi$}}} \exp(\sum_t r_\psi(s_t,a_t))

Maximum Likelihood Learning:

\begin{split} \max_{\psi} \frac{1}{N} \sum_{i=1}^N \log p(\tau_i | O_{1:T}, \psi) &= \max_{\psi} \frac{1}{N} \sum_{i=1}^N r_\psi(\tau_i) - \log Z \end{split}

📢

Why we add a negative

\log Z

term here? Because if we only want to max rewards, we can make them high everywhere and the learned reward function is basically unusable.

Z

is called the partition function

IRL Parition Function

Z = \int p(\tau) \exp(r_\psi(\tau)) d\tau

🤧

Wow! That looks like an integral over all possible trajectories and we know that this is probably intractable, but let’s plug this into maximum likelihood learning and see what happens anyway!

\begin{split} \nabla_\psi L &= \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{Z} \int p(\tau) \exp(r_\psi(\tau)) \nabla_\psi r_\psi(\tau) d\tau \\ &= \mathbb{E}_{\tau \sim \pi^*(\tau)}[\nabla_\psi r_\psi(\tau_i)] - \mathbb{E}_{\tau \sim p(\tau|O_{1:T},\psi)}[\nabla_{\psi} r_\psi(\tau)] \end{split}

OK, this looks reasonable
- Increase rewards at data points that we saw in the expert data
- Decrease rewards at data points in the soft optimal policy of our current reward estimate

Let’s proceed and estimate this expectation

\begin{split} \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)] &= \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi \sum_{t=1}^T r_\psi(s_t,a_t)] \\ &= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)] \end{split}

Note:

\begin{split} p(s_t,a_t)|O_{1:T}, \psi) &= \overbrace{p(a_t|s_t, O_{1:T}, \psi)}^{\mathclap{=\frac{\beta(s_t,a_t)}{\beta(s_t)}}} \underbrace{p(s_t|O_{1:T},\psi)}_{\mathclap{\propto \alpha (s_t) \beta(s_t)}} \\ &\propto \beta(s_t,a_t) \alpha(s_t) \end{split}

Let $\mu_t(s_t,a_t) \propto \beta(s_t,a_t) \alpha(s_t)$

\begin{split} \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)] &= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)] \\ &=\sum_{t=1}^T \int \int \mu_t(s_t,a_t) \nabla_\psi r_\psi(s_t,a_t) ds_t da_t \\ &=\sum_{t=1}^T \vec{\mu}^\top \nabla_\psi \vec{r}_\psi \end{split}

Note: This works for small and discrete state and action spaces so we can estimate $\vec{\mu}$

MaxEnt IRL

Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning

Its nice because

If $r_\psi(s_t,a_t) = \psi^{\top} f(s_t,a_t)$ , we can show that it optimizes
- $\max_{\psi} H(\pi^{r_\psi})$ such that $\mathbb{E}_{\pi^{r_\psi}}[f] = \mathbb{E}_{\pi^*}[f]$

Extending to high dimensional spaces

MaxEnt IRL so far requires
- Solving for (soft) optial policy in the inner loop
- Enumerting all state-action tuples for visitation frequency and gradient

To apply this in practical problem settings, we need to handle
- Large and continuous state and action spaces
- States obtained via sampling only
- Unknown dynamics
  - causing the naive method of computing backward/forward messages unstable

Assume we can sample from the real environment

Idea:

Learn $p(a_t|s_t, O_{1:T}, \psi)$ using any max-ent RL algorithm then run this policy to sample $\{ \tau_j \}$ .
- $\nabla_{\psi} L \approx \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{M} \sum_{j=1}^M \nabla_\psi r_\psi(\tau_j)$
- Looks expensive, what if we use “lazy” policy optimization ⇒ instead of learning a policy every time can we optimize a bit?

Improve $p(a_t|s_t, O_{1:T}, \psi)$ using any max-ent RL algorithm
- But now the estimator is now biased; wrong distribution
  - Use importance sampling
    - $\nabla_{\psi} L \approx \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{\sum_j w_j} \sum_{j=1}^M w_j \nabla_\psi r_\psi(\tau_j)$

Importance weights

\begin{split} w_j &= \frac{p(\tau) \exp(r_\psi(\tau_j))}{\pi(\tau_j)} \\ & = \frac{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \exp(r_\psi(s_t,a_t))}{p(s_1) \prod_t p(s_{t+1}|s_t,a_t) \pi(a_t|s_t)} \\ &=\frac{\exp(\sum_t r_\psi(s_t,a_t))}{\prod_t \pi(a_t|s_t)} \end{split}

Each policy update with respect to $r_\psi$ brings better importane weight

IRL and GANs

The IRL algorithm described earlier looks like a game

“Policy trying to look as good as the expert, reward trying to differentiate experts from policy”

Inverse RL as a GAN

Finn*, Christiano* et al. “A Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.”

In GAN, the optimal discriminator is

D^*(x) = \frac{p^*(x)}{p_\theta(x) + p^*(x)}

The density distribution of real and fake if the discriminator is converged

We know for IRL, the optimal policy approaches

\pi_\theta(\tau) \propto p(\tau) \exp(r_\psi(\tau))

The parameterization for discriminator for IRL is:

\begin{split} D_\psi(\tau) &= \frac{p(\tau) \frac{1}{Z} \exp(r(\tau))}{p_\theta(\tau) + p(\tau) \frac{1}{Z}\exp(r(\tau))} \\ &= \frac{p(\tau) Z^{-1} \exp(r(\tau))}{p(\tau)\prod_t \pi_\theta(a_t|s_t) + p(\tau) Z^{-1} \exp(r(\tau))} \\ &= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))} \end{split}

We optimize $D_\psi$ w.r.t. $\psi$

Don’t need importance weights any more, they are subsumed in $Z$

\psi \leftarrow \argmax_{\psi} \mathbb{E}_{\tau \sim p^*}[\log D_{\psi}(\tau)] + \mathbb{E}_{\tau \sim \pi_\theta}[\log (1-D_\psi(\tau))]

What if we instead of using

\begin{split} D_\psi(\tau) &= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))} \end{split}

Can we just use a standard binary neural net classifier?

often simpler to set up optimization, fewer moving parts

discriminator knows nothing at convergence

generally cannot reoptimize the “reward