Remember
p(Ot+1:T∣st+1)=∫p(Ot+1:T∣st+1)p(at+1∣st+1)dat+1 We’ve defined
Vt(st)=logβt(st) Qt(st)=logβt(st,at) We assumed action prior is uniform, but what if it is not?
V(st)=log∫exp(Q(st,at)+logp(at∣st))dat Q(st,at)=r(st,at)+logE[exp(V(st+1))] Now let
Q~(st,at)=r(st,at)+logp(at∣st)+logE[exp(V(st+1))] V(st)=log∫exp(Q~(st,at))dat=log∫exp(Q(st,at)+logp(at∣st))dat Oh! Now we’ve seen that with a modificatio to the reward function we can recover V and Q with a different action prior