We are actually trying to learn a distribution of action a given the observation o, where θ is the parameter of the distribution policy π. we add subscript t to each term to denote the index in the time-serie since we know in RL that actions are usually correlated.
Sometimes instead of πθ(at∣ot) we will see πθ(at∣st).
Difference - can use observation to infer state, but sometimes (in this example, car in front of the cheetah) the observation does not give enough information to let us infer state.
State is the true configuration of the system and observation is something that results from the state which may or may not be enough to deduce the state
Markov property is very important property ⇒ If you want to change the future you just need to focus on current state and your current action
However, is your future state independent of your past observation? In general, observations do not satisfy Markov property
Many RL algorithms in this course requires Markov property
Side note on notation
RL world (Richard Bellman)
Robotics (Lev Pontryagin)
st state
xt state
at action
ut action
r(s,a) reward
c(x,u) cost =−r(x,u)
In general, we want to minimize:
We can either write the funciton c as c or r, which stands for either (minimize) cost or (maximize) reward
Definition of Markov Chain
The transition “operator”?
Definition of Markov Decision Process
Partially Observed Markov Decision Process
Goal of RL (Finite Horizon / Episodic)
So our goal is to maximize the expectation of rewards
The markov chain can be described using:
So now we can modify the goal a bit towards the markov chain.
Goal of RL (Infinite Horizon)
What if T=∞? (Infinite horizon)
We need some way to make objective finite
Use average award θ∗=argmaxθT1Eτ∼pθ(τ)[∑tr(st,at)]
Use discounts
Q: Does p(st,at) converge to a stationary distribution as t→∞?
Ergodic Property: Every state can transition into another state with a non-zero probability.
“Stationary”: The same before and after transition
This means that μ is an eigenvector of T with eigenvalue of 1 (always exists under some regularity conditions)
As t goes further and further from infinity, the reward terms starts to get dominated by the stationary distribution (we have almost inifite μ states and a finite non-stationary action/state pairs
Expectations and Stochastic Systems
RL is about maximizing the expectation of rewards. And expectations can be continuous in the parameters of the corresponding distributions even when the function we are taking expectation of is itself highly discontinuous. Because of this property we can use smooth optimization methods like Gradient Descent
High-level anatomy of reinforcement learning algorithms
Which parts are expensive?
Value Function & Q Function
Remember that RL problems are framed as argmax of expectation of rewards, but how do we actually deal with the expectations?
We can write this expectation out explicitly as a nested expectation (looks like a DP algorithm!)
Then our equation is simplified into
It’s a lot easier to modify the parameter θ of πθ(a1∣s1) if Q(s1,a1) is known
Q-Function Definition
Value Function Definition
Now Es1∼p(s1)[Vπ(s1)] is the RL objective.
To use those two functions:
If we have policy π, and we know Qπ(s,a), then we can improve π:
(This policy is at least as good as π and probably better)
OR
Compute gradient to increase the probability of good actions a
if Qπ(s,a)>Vπ(s), then a is better than average. Recall that Vπ(s)=E[Qπ(s,a)] under π(a∣s).
Modify π(a∣s) to increase the probability of a if Qπ(s,a)>Vπ(s)