So far, weβve hand designed reward function to define a task What if we want to learn the reward function from observing an expert and then apply reinforcement learning through the learned reward function? Apply approximate optimality model(Control as an Inference model) from last time, but now learn the reward π’
Inverse Reinforcement Learning : Infer reward functions from demonstrations
Standard Imitation Learning
Copy the actions performed by the expert No reasoning about outcomes of actions Human Imitation Learning
Copy the intent of the expert might take very different actions!
Problem: many reward functions can explain the same behavior
Top-left: behavior, top-center: possible reward block, top-right: possible reward block, bottom-left: possoble reward blocks Reward Parameterization
Traditional Linear Formulationr Ο ( s , a ) = β i Ο i f i ( s , a ) = Ο β€ f ( s , a ) r_\psi(s,a) = \sum_i \psi_i f_i(s,a) = \psi^{\top} f(s,a) r Ο β ( s , a ) = β i β Ο i β f i β ( s , a ) = Ο β€ f ( s , a ) ο»Ώ f ( s , a ) f(s,a) f ( s , a ) ο»Ώ feature function Neural Net Formulationr Ο ( s , a ) r_\psi(s,a) r Ο β ( s , a ) ο»Ώ
Feature Matching IRL Linear Reward Function
r Ο ( s , a ) = β i Ο i f i ( s , a ) = Ο β€ f ( s , a ) r_\psi(s,a) = \sum_i \psi_i f_i(s,a) = \psi^{\top} f(s,a) r Ο β ( s , a ) = i β β Ο i β f i β ( s , a ) = Ο β€ f ( s , a ) If features f f f ο»Ώ are important, what if we match their expectations?
Let Ο r Ο \pi^{r_\psi} Ο r Ο β ο»Ώ be the optimal policy for r Ο r_\psi r Ο β ο»Ώ
If we pick Ο \psi Ο ο»Ώ such that E Ο r Ο [ f ( s , a ) ] = E Ο β [ f ( s , a ) ] \mathbb{E}_{\pi^{r_\psi}}[f(s,a)] = \mathbb{E}_{\pi^*}[f(s,a)] E Ο r Ο β β [ f ( s , a )] = E Ο β β [ f ( s , a )] ο»Ώ , Itβs ambiguous because
We can estimate the optimal policy expectation by averaging expert samples But Multiple different Ο \psi Ο ο»Ώ vectors can still result in the same feature expectations So now if we choose to use max margin principle (similar to SVM):
max β‘ Ο , m m s.t. Ο β€ E Ο β [ f ( s , a ) ] β₯ max β‘ Ο β Ξ Ο β€ E Ο [ f ( s , a ) ] + m \max_{\psi, m} m \\
\text{s.t.} \\
\psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^{\top} \mathbb{E}_\pi [f(s,a)] + m Ο , m max β m s.t. Ο β€ E Ο β β [ f ( s , a )] β₯ Ο β Ξ max β Ο β€ E Ο β [ f ( s , a )] + m Itβs a heuristic β not necessarily mean weβll recover the true weight of expertβs reward function, but itβs a reasonable heuristic. But we need to somehow weight the margin by similarity between Ο β \pi^* Ο β ο»Ώ and Ο \pi Ο ο»Ώ because in a continuous policy space there will probably be policies very similar to the optimal policy. apply the βSVM trickβ, the problem becomes
min β‘ Ο 1 2 β£ β£ Ο β£ β£ 2 s.t. Ο β€ E Ο β [ f ( s , a ) ] β₯ max β‘ Ο β Ξ Ο β€ E Ο [ f ( s , a ) ] + 1 \min_\psi \frac{1}{2} ||\psi||^2 \\
\text{s.t.} \\
\psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + 1 Ο min β 2 1 β β£β£ Ο β£ β£ 2 s.t. Ο β€ E Ο β β [ f ( s , a )] β₯ Ο β Ξ max β Ο β€ E Ο β [ f ( s , a )] + 1 Letβs also add in a measure of difference in policy
min β‘ Ο 1 2 β£ β£ Ο β£ β£ 2 s.t. Ο β€ E Ο β [ f ( s , a ) ] β₯ max β‘ Ο β Ξ Ο β€ E Ο [ f ( s , a ) ] + D ( Ο , Ο β ) \min_\psi \frac{1}{2} ||\psi||^2 \\
\text{s.t.} \\
\psi^{\top} \mathbb{E}_{\pi^*}[f(s,a)] \ge \max_{\pi \in \Pi} \psi^\top \mathbb{E}_{\pi}[f(s,a)] + D(\pi, \pi^*) Ο min β 2 1 β β£β£ Ο β£ β£ 2 s.t. Ο β€ E Ο β β [ f ( s , a )] β₯ Ο β Ξ max β Ο β€ E Ο β [ f ( s , a )] + D ( Ο , Ο β ) Note:
D ( β
) D(\cdot) D ( β
) ο»Ώ can be either difference in feature expectations or KL divergence
Issues:
Maximizing the margin is a bit arbitraryβFind a reward function which the expertβs policy is clearly better than any other policiesβ No clear model of expert sub-optimalityCan add slack variables just like in SVM to allow sub-optimality Messy constrained optimization problemNot good for deep learning
Abbeel & Ng: Apprenticeship learning via inverse reinforcement learning
Ratliff et al: Maximum margin planning
Optimal Control as a Model of Human Behavior p ( s 1 : T , a 1 : T ) = ? p(s_{1:T}, a_{1:T}) = ? p ( s 1 : T β , a 1 : T β ) = ? What we know:
p ( O t β£ s t , a t ) = exp β‘ ( r ( s t , a t ) ) p(O_t|s_t,a_t) = \exp(r(s_t,a_t)) p ( O t β β£ s t β , a t β ) = exp ( r ( s t β , a t β )) But we can do optimality inference (What is the probability of trajectory given optimality)
p ( Ο β£ O 1 : T ) = p ( Ο , O 1 : T ) p ( O 1 : T ) β p ( Ο ) β t exp β‘ ( r ( s t , a t ) ) = p ( Ο ) exp β‘ ( β t r ( s t , a t ) ) \begin{split}
p(\tau|O_{1:T}) &= \frac{p(\tau, O_{1:T})}{p(O_{1:T})} \\
&\propto p(\tau) \prod_t \exp(r(s_t,a_t)) \\
&\quad = p(\tau) \exp(\sum_{t} r(s_t,a_t))
\end{split} p ( Ο β£ O 1 : T β ) β = p ( O 1 : T β ) p ( Ο , O 1 : T β ) β β p ( Ο ) t β β exp ( r ( s t β , a t β )) = p ( Ο ) exp ( t β β r ( s t β , a t β )) β Learning the optimality variable p ( O t β£ s t , a t , Ο ) = exp β‘ ( r Ο ( s t , a t ) ) p(O_t|s_t,a_t,\psi) = \exp(r_\psi(s_t,a_t)) p ( O t β β£ s t β , a t β , Ο ) = exp ( r Ο β ( s t β , a t β )) p ( Ο β£ O 1 : T , Ο ) β p ( Ο ) undefined turnsΒ outΒ weΒ canΒ ignoreΒ becauseΒ independentΒ ofΒ Ο exp β‘ ( β t r Ο ( s t , a t ) ) p(\tau|O_{1:T}, \psi) \propto \underbrace{p(\tau)}_{\mathclap{\text{turns out we can ignore because independent of $\psi$}}} \exp(\sum_t r_\psi(s_t,a_t)) p ( Ο β£ O 1 : T β , Ο ) β turnsΒ outΒ weΒ canΒ ignoreΒ becauseΒ independentΒ ofΒ Ο p ( Ο ) β β exp ( t β β r Ο β ( s t β , a t β )) Maximum Likelihood Learning:
max β‘ Ο 1 N β i = 1 N log β‘ p ( Ο i β£ O 1 : T , Ο ) = max β‘ Ο 1 N β i = 1 N r Ο ( Ο i ) β log β‘ Z \begin{split}
\max_{\psi} \frac{1}{N} \sum_{i=1}^N \log p(\tau_i | O_{1:T}, \psi)
&= \max_{\psi} \frac{1}{N} \sum_{i=1}^N r_\psi(\tau_i) - \log Z
\end{split}
Ο max β N 1 β i = 1 β N β log p ( Ο i β β£ O 1 : T β , Ο ) β = Ο max β N 1 β i = 1 β N β r Ο β ( Ο i β ) β log Z β π’
Why we add a negative
log β‘ Z \log Z log Z ο»Ώ term here? Because if we only want to max rewards, we can make them high everywhere and the learned reward function is basically unusable.
Z Z Z ο»Ώ is called the
partition function IRL Parition Function Z = β« p ( Ο ) exp β‘ ( r Ο ( Ο ) ) d Ο Z = \int p(\tau) \exp(r_\psi(\tau)) d\tau Z = β« p ( Ο ) exp ( r Ο β ( Ο )) d Ο π€§
Wow! That looks like an integral over all possible trajectories and we know that this is probably intractable, but letβs plug this into maximum likelihood learning and see what happens anyway!
β Ο L = 1 N β i = 1 N β Ο r Ο ( Ο i ) β 1 Z β« p ( Ο ) exp β‘ ( r Ο ( Ο ) ) β Ο r Ο ( Ο ) d Ο = E Ο βΌ Ο β ( Ο ) [ β Ο r Ο ( Ο i ) ] β E Ο βΌ p ( Ο β£ O 1 : T , Ο ) [ β Ο r Ο ( Ο ) ] \begin{split}
\nabla_\psi L
&= \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{Z} \int p(\tau) \exp(r_\psi(\tau)) \nabla_\psi r_\psi(\tau) d\tau \\
&= \mathbb{E}_{\tau \sim \pi^*(\tau)}[\nabla_\psi r_\psi(\tau_i)] - \mathbb{E}_{\tau \sim p(\tau|O_{1:T},\psi)}[\nabla_{\psi} r_\psi(\tau)]
\end{split} β Ο β L β = N 1 β i = 1 β N β β Ο β r Ο β ( Ο i β ) β Z 1 β β« p ( Ο ) exp ( r Ο β ( Ο )) β Ο β r Ο β ( Ο ) d Ο = E Ο βΌ Ο β ( Ο ) β [ β Ο β r Ο β ( Ο i β )] β E Ο βΌ p ( Ο β£ O 1 : T β , Ο ) β [ β Ο β r Ο β ( Ο )] β OK, this looks reasonableIncrease rewards at data points that we saw in the expert data Decrease rewards at data points in the soft optimal policy of our current reward estimate Letβs proceed and estimate this expectation E Ο βΌ p ( Ο β£ O 1 : T , Ο ) [ β Ο r Ο ( Ο ) ] = E Ο βΌ p ( Ο β£ O 1 : T , Ο ) [ β Ο β t = 1 T r Ο ( s t , a t ) ] = β t = 1 T E ( s t , a t ) βΌ p ( s t , a t β£ O 1 : T , Ο ) [ β Ο r Ο ( s t , a t ) ] \begin{split}
\mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)]
&= \mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi \sum_{t=1}^T r_\psi(s_t,a_t)] \\
&= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)]
\end{split} E Ο βΌ p ( Ο β£ O 1 : T β , Ο ) β [ β Ο β r Ο β ( Ο )] β = E Ο βΌ p ( Ο β£ O 1 : T β , Ο ) β [ β Ο β t = 1 β T β r Ο β ( s t β , a t β )] = t = 1 β T β E ( s t β , a t β ) βΌ p ( s t β , a t β β£ O 1 : T β , Ο ) β [ β Ο β r Ο β ( s t β , a t β )] β Note:
p ( s t , a t ) β£ O 1 : T , Ο ) = p ( a t β£ s t , O 1 : T , Ο ) undefined = Ξ² ( s t , a t ) Ξ² ( s t ) p ( s t β£ O 1 : T , Ο ) undefined β Ξ± ( s t ) Ξ² ( s t ) β Ξ² ( s t , a t ) Ξ± ( s t ) \begin{split}
p(s_t,a_t)|O_{1:T}, \psi)
&= \overbrace{p(a_t|s_t, O_{1:T}, \psi)}^{\mathclap{=\frac{\beta(s_t,a_t)}{\beta(s_t)}}} \underbrace{p(s_t|O_{1:T},\psi)}_{\mathclap{\propto \alpha (s_t) \beta(s_t)}} \\
&\propto \beta(s_t,a_t) \alpha(s_t)
\end{split} p ( s t β , a t β ) β£ O 1 : T β , Ο ) β = p ( a t β β£ s t β , O 1 : T β , Ο ) β = Ξ² ( s t β ) Ξ² ( s t β , a t β ) β β β Ξ± ( s t β ) Ξ² ( s t β ) p ( s t β β£ O 1 : T β , Ο ) β β β Ξ² ( s t β , a t β ) Ξ± ( s t β ) β Let ΞΌ t ( s t , a t ) β Ξ² ( s t , a t ) Ξ± ( s t ) \mu_t(s_t,a_t) \propto \beta(s_t,a_t) \alpha(s_t) ΞΌ t β ( s t β , a t β ) β Ξ² ( s t β , a t β ) Ξ± ( s t β ) ο»Ώ
E Ο βΌ p ( Ο β£ O 1 : T , Ο ) [ β Ο r Ο ( Ο ) ] = β t = 1 T E ( s t , a t ) βΌ p ( s t , a t β£ O 1 : T , Ο ) [ β Ο r Ο ( s t , a t ) ] = β t = 1 T β« β« ΞΌ t ( s t , a t ) β Ο r Ο ( s t , a t ) d s t d a t = β t = 1 T ΞΌ β β€ β Ο r β Ο \begin{split}
\mathbb{E}_{\tau\sim p(\tau|O_{1:T},\psi)} [\nabla_\psi r_\psi (\tau)]
&= \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t|O_{1:T},\psi)}[\nabla_\psi r_\psi(s_t,a_t)] \\
&=\sum_{t=1}^T \int \int \mu_t(s_t,a_t) \nabla_\psi r_\psi(s_t,a_t) ds_t da_t \\
&=\sum_{t=1}^T \vec{\mu}^\top \nabla_\psi \vec{r}_\psi
\end{split} E Ο βΌ p ( Ο β£ O 1 : T β , Ο ) β [ β Ο β r Ο β ( Ο )] β = t = 1 β T β E ( s t β , a t β ) βΌ p ( s t β , a t β β£ O 1 : T β , Ο ) β [ β Ο β r Ο β ( s t β , a t β )] = t = 1 β T β β«β« ΞΌ t β ( s t β , a t β ) β Ο β r Ο β ( s t β , a t β ) d s t β d a t β = t = 1 β T β ΞΌ β β€ β Ο β r Ο β β Note: This works for small and discrete state and action spaces so we can estimate ΞΌ β \vec{\mu} ΞΌ β ο»Ώ
MaxEnt IRL Ziebart et al. 2008: Maximum Entropy Inverse Reinforcement Learning Its nice because
If r Ο ( s t , a t ) = Ο β€ f ( s t , a t ) r_\psi(s_t,a_t) = \psi^{\top} f(s_t,a_t) r Ο β ( s t β , a t β ) = Ο β€ f ( s t β , a t β ) ο»Ώ , we can show that it optimizesmax β‘ Ο H ( Ο r Ο ) \max_{\psi} H(\pi^{r_\psi}) max Ο β H ( Ο r Ο β ) ο»Ώ such that E Ο r Ο [ f ] = E Ο β [ f ] \mathbb{E}_{\pi^{r_\psi}}[f] = \mathbb{E}_{\pi^*}[f] E Ο r Ο β β [ f ] = E Ο β β [ f ] ο»Ώ
Extending to high dimensional spaces MaxEnt IRL so far requiresSolving for (soft) optial policy in the inner loop Enumerting all state-action tuples for visitation frequency and gradient To apply this in practical problem settings, we need to handleLarge and continuous state and action spaces States obtained via sampling only Unknown dynamicscausing the naive method of computing backward/forward messages unstable
Assume we can sample from the real environment
Idea:
Learn p ( a t β£ s t , O 1 : T , Ο ) p(a_t|s_t, O_{1:T}, \psi) p ( a t β β£ s t β , O 1 : T β , Ο ) ο»Ώ using any max-ent RL algorithm then run this policy to sample { Ο j } \{ \tau_j \} { Ο j β } ο»Ώ .β Ο L β 1 N β i = 1 N β Ο r Ο ( Ο i ) β 1 M β j = 1 M β Ο r Ο ( Ο j ) \nabla_{\psi} L \approx \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{M} \sum_{j=1}^M \nabla_\psi r_\psi(\tau_j) β Ο β L β N 1 β β i = 1 N β β Ο β r Ο β ( Ο i β ) β M 1 β β j = 1 M β β Ο β r Ο β ( Ο j β ) ο»Ώ Looks expensive, what if we use βlazyβ policy optimization β instead of learning a policy every time can we optimize a bit? Improve p ( a t β£ s t , O 1 : T , Ο ) p(a_t|s_t, O_{1:T}, \psi) p ( a t β β£ s t β , O 1 : T β , Ο ) ο»Ώ using any max-ent RL algorithmBut now the estimator is now biased; wrong distributionUse importance samplingβ Ο L β 1 N β i = 1 N β Ο r Ο ( Ο i ) β 1 β j w j β j = 1 M w j β Ο r Ο ( Ο j ) \nabla_{\psi} L \approx \frac{1}{N} \sum_{i=1}^N \nabla_\psi r_\psi(\tau_i) - \frac{1}{\sum_j w_j} \sum_{j=1}^M w_j \nabla_\psi r_\psi(\tau_j) β Ο β L β N 1 β β i = 1 N β β Ο β r Ο β ( Ο i β ) β β j β w j β 1 β β j = 1 M β w j β β Ο β r Ο β ( Ο j β ) ο»Ώ
Importance weights
w j = p ( Ο ) exp β‘ ( r Ο ( Ο j ) ) Ο ( Ο j ) = p ( s 1 ) β t p ( s t + 1 β£ s t , a t ) exp β‘ ( r Ο ( s t , a t ) ) p ( s 1 ) β t p ( s t + 1 β£ s t , a t ) Ο ( a t β£ s t ) = exp β‘ ( β t r Ο ( s t , a t ) ) β t Ο ( a t β£ s t ) \begin{split}
w_j
&= \frac{p(\tau) \exp(r_\psi(\tau_j))}{\pi(\tau_j)} \\
& = \frac{p(s_1)\prod_t p(s_{t+1}|s_t,a_t) \exp(r_\psi(s_t,a_t))}{p(s_1) \prod_t p(s_{t+1}|s_t,a_t) \pi(a_t|s_t)} \\
&=\frac{\exp(\sum_t r_\psi(s_t,a_t))}{\prod_t \pi(a_t|s_t)}
\end{split} w j β β = Ο ( Ο j β ) p ( Ο ) exp ( r Ο β ( Ο j β )) β = p ( s 1 β ) β t β p ( s t + 1 β β£ s t β , a t β ) Ο ( a t β β£ s t β ) p ( s 1 β ) β t β p ( s t + 1 β β£ s t β , a t β ) exp ( r Ο β ( s t β , a t β )) β = β t β Ο ( a t β β£ s t β ) exp ( β t β r Ο β ( s t β , a t β )) β β Each policy update with respect to r Ο r_\psi r Ο β ο»Ώ brings better importane weight
IRL and GANs The IRL algorithm described earlier looks like a game βPolicy trying to look as good as the expert, reward trying to differentiate experts from policyβ
Generative Adversarial Networks (GANs) Inverse RL as a GAN Finn*, Christiano* et al. βA Connection Between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models.β In GAN, the optimal discriminator is
D β ( x ) = p β ( x ) p ΞΈ ( x ) + p β ( x ) D^*(x) = \frac{p^*(x)}{p_\theta(x) + p^*(x)} D β ( x ) = p ΞΈ β ( x ) + p β ( x ) p β ( x ) β The density distribution of real and fake if the discriminator is converged We know for IRL, the optimal policy approaches
Ο ΞΈ ( Ο ) β p ( Ο ) exp β‘ ( r Ο ( Ο ) ) \pi_\theta(\tau) \propto p(\tau) \exp(r_\psi(\tau)) Ο ΞΈ β ( Ο ) β p ( Ο ) exp ( r Ο β ( Ο )) The parameterization for discriminator for IRL is:
D Ο ( Ο ) = p ( Ο ) 1 Z exp β‘ ( r ( Ο ) ) p ΞΈ ( Ο ) + p ( Ο ) 1 Z exp β‘ ( r ( Ο ) ) = p ( Ο ) Z β 1 exp β‘ ( r ( Ο ) ) p ( Ο ) β t Ο ΞΈ ( a t β£ s t ) + p ( Ο ) Z β 1 exp β‘ ( r ( Ο ) ) = Z β 1 exp β‘ ( r ( Ο ) ) β t Ο ΞΈ ( a t β£ s t ) + Z β 1 exp β‘ ( r ( Ο ) ) \begin{split}
D_\psi(\tau)
&= \frac{p(\tau) \frac{1}{Z} \exp(r(\tau))}{p_\theta(\tau) + p(\tau) \frac{1}{Z}\exp(r(\tau))} \\
&= \frac{p(\tau) Z^{-1} \exp(r(\tau))}{p(\tau)\prod_t \pi_\theta(a_t|s_t) + p(\tau) Z^{-1} \exp(r(\tau))} \\
&= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))}
\end{split} D Ο β ( Ο ) β = p ΞΈ β ( Ο ) + p ( Ο ) Z 1 β exp ( r ( Ο )) p ( Ο ) Z 1 β exp ( r ( Ο )) β = p ( Ο ) β t β Ο ΞΈ β ( a t β β£ s t β ) + p ( Ο ) Z β 1 exp ( r ( Ο )) p ( Ο ) Z β 1 exp ( r ( Ο )) β = β t β Ο ΞΈ β ( a t β β£ s t β ) + Z β 1 exp ( r ( Ο )) Z β 1 exp ( r ( Ο )) β β We optimize D Ο D_\psi D Ο β ο»Ώ w.r.t. Ο \psi Ο ο»Ώ
Donβt need importance weights any more, they are subsumed in Z Z Z ο»Ώ
Ο β argβmax β‘ Ο E Ο βΌ p β [ log β‘ D Ο ( Ο ) ] + E Ο βΌ Ο ΞΈ [ log β‘ ( 1 β D Ο ( Ο ) ) ] \psi \leftarrow \argmax_{\psi} \mathbb{E}_{\tau \sim p^*}[\log D_{\psi}(\tau)] + \mathbb{E}_{\tau \sim \pi_\theta}[\log (1-D_\psi(\tau))] Ο β Ο arg max β E Ο βΌ p β β [ log D Ο β ( Ο )] + E Ο βΌ Ο ΞΈ β β [ log ( 1 β D Ο β ( Ο ))] What if we instead of using
D Ο ( Ο ) = Z β 1 exp β‘ ( r ( Ο ) ) β t Ο ΞΈ ( a t β£ s t ) + Z β 1 exp β‘ ( r ( Ο ) ) \begin{split}
D_\psi(\tau)
&= \frac{Z^{-1} \exp(r(\tau))}{\prod_t \pi_\theta(a_t|s_t) + Z^{-1}\exp(r(\tau))}
\end{split} D Ο β ( Ο ) β = β t β Ο ΞΈ β ( a t β β£ s t β ) + Z β 1 exp ( r ( Ο )) Z β 1 exp ( r ( Ο )) β β Can we just use a standard binary neural net classifier?
often simpler to set up optimization, fewer moving parts discriminator knows nothing at convergence generally cannot reoptimize the βreward Suggested Reading Classic Papers:
Abbeel & Ng ICML β04. Apprenticeship Learning via Inverse Reinforcement Learning.Good introduction to inverse reinforcement learning Ziebart et al. AAAI β08. Maximum Entropy Inverse Reinforcement Learning. Introduction to probabilistic method for inverse reinforcement learning Modern Papers:
Finn et al. ICML β16. Guided Cost Learning. Sampling based method for MaxEnt IRL that handles unknown dynamics and deep reward functions Wulfmeier et al. arXiv β16. Deep Maximum Entropy Inverse Reinforcement Learning. MaxEnt inverse RL using deep reward functions Ho & Ermon NIPS β16. Generative Adversarial Imitation Learning. Inverse RL method using generative adversarial networks Fu, Luo, Levine ICLR β18. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning