Action Prior in Backward Pass of Control as Inference in RL (1)Rememberp(Ot+1:T∣st+1)=∫p(Ot+1:T∣st+1)p(at+1∣st+1)dat+1p(O_{t+1:T}|s_{t+1}) = \int p(O_{t+1:T}|s_{t+1}) p(a_{t+1}|s_{t+1}) da_{t+1}p(Ot+1:T∣st+1)=∫p(Ot+1:T∣st+1)p(at+1∣st+1)dat+1We’ve definedVt(st)=logβt(st)V_t(s_t) = \log \beta_t(s_t)Vt(st)=logβt(st)Qt(st)=logβt(st,at)Q_t(s_t) = \log \beta_t(s_t,a_t)Qt(st)=logβt(st,at)We assumed action prior is uniform, but what if it is not?V(st)=log∫exp(Q(st,at)+logp(at∣st))datV(s_t) = \log \int \exp(Q(s_t,a_t)+ \log p(a_t|s_t)) da_tV(st)=log∫exp(Q(st,at)+logp(at∣st))datQ(st,at)=r(st,at)+logE[exp(V(st+1))]Q(s_t,a_t) = r(s_t,a_t) + \log \mathbb{E}[\exp(V(s_{t+1}))]Q(st,at)=r(st,at)+logE[exp(V(st+1))]Now letQ~(st,at)=r(st,at)+logp(at∣st)+logE[exp(V(st+1))]\tilde{Q}(s_t, a_t) = r(s_t, a_t) + \log p(a_t|s_t) + \log \mathbb{E}[\exp(V(s_{t+1}))]Q~(st,at)=r(st,at)+logp(at∣st)+logE[exp(V(st+1))]V(st)=log∫exp(Q~(st,at))dat=log∫exp(Q(st,at)+logp(at∣st))datV(s_t) = \log \int \exp(\tilde{Q}(s_t,a_t)) da_t = \log \int \exp(Q(s_t,a_t)+\log p(a_t|s_t))da_tV(st)=log∫exp(Q~(st,at))dat=log∫exp(Q(st,at)+logp(at∣st))datOh! Now we’ve seen that with a modificatio to the reward function we can recover VVV and QQQ with a different action prior