Fundamental Theorem of RL

Fundamental Theorem of RL(Lec 4)

Notations + Markov Property

We are actually trying to learn a distribution of action $a$ given the observation $o$ , where $\theta$ is the parameter of the distribution policy $\pi$ . we add subscript $t$ to each term to denote the index in the time-serie since we know in RL that actions are usually correlated.

Sometimes instead of $\pi_\theta (a_t|o_t)$ we will see $\pi_\theta(a_t|s_t)$ .

s_t \rightarrow \text{state} \\ \pi_\theta(a_t|s_t) \rightarrow \text{policy (fully observed)} \\ o_t \rightarrow \text{observation} \\ \pi_\theta(a_t|o_t) \rightarrow \text{policy}

Difference - can use observation to infer state, but sometimes (in this example, car in front of the cheetah) the observation does not give enough information to let us infer state.

State is the true configuration of the system and observation is something that results from the state which may or may not be enough to deduce the state

🔥

The state is usually assumed to be an Markovian state

Markov property is very important property ⇒ If you want to change the future you just need to focus on current state and your current action

However, is your future state independent of your past observation? In general, observations do not satisfy Markov property

Many RL algorithms in this course requires Markov property

Side note on notation

RL world (Richard Bellman)	Robotics (Lev Pontryagin)
$s_t$ state	$x_t$ state
$a_t$ action	$u_t$ action
$r(s,a)$ reward	$c(x,u)$ cost $= -r(x,u)$

In general, we want to minimize:

\min_\theta \mathbb{E}_{s_1:T, a_1: T} [\sum_t c(s_t,a_t)]

We can either write the funciton $c$ as $c$ or $r$ , which stands for either (minimize) cost or (maximize) reward

🔥

Notice that we want to maximize the reward (or minimize the cost) under the expectation of the states that we are going to encounter in production / test environment, not training states.

Definition of Markov Chain

M=\{S,T\}

S \rightarrow \text{state space} \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ T \rightarrow \text{transition operator described by } \mathbb{P}(s_{t+1}|s_t)

The transition “operator”?

\text{let } \mu_{t,i} = \mathbb{P}(s_t = i), \\ \text{then $\vec{\mu}_t$ is a vector of probabilities}, \\ \text{let } T_{i,j} = \mathbb{P}(s_{t+1}=i|s_i=j), \\ \text{then } \vec{\mu}_{t+1} = T\vec{\mu}_t

Definition of Markov Decision Process

M = \{S,A,T,r\}

S \rightarrow \text{action space}, \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ A \rightarrow \text{action space}, \\ a \in A \rightarrow \text{action (discrete or continuous)} \\ r: S \times A \rightarrow \mathbb{R} \rightarrow \\

T \rightarrow \text{transitional operator (now a tensor)} \\ \mu_{t,i} = \mathbb{P}(s_t=i), \\ \xi_{t,k} = \mathbb{P}(a_t=k), \\ \text{So } T_{i,j,k} = \mathbb{P}(s_{t+1}=i | s_t=j,a_t=k) \\ \mu_{t+1,i} = \sum_{j,k}T_{i,j,k} \mu_{t,j} \xi_{t,k}

Partially Observed Markov Decision Process

M = \{S,A,O,T,\xi,r\} \\ S \rightarrow \text{state space} \\ s \in S \rightarrow \text{state (discrete or continuous)} \\ A \rightarrow \text{action space} \\ a \in A \rightarrow \text{action (discrete or continuous)} \\ O \rightarrow \text{observation space} \\ o \in O \rightarrow \text{observation (discrete or continuous)} \\ T \rightarrow \text{transition operation} \\ \xi \rightarrow \text{emission probability $\mathbb{P}(o_t|s_t)$} \\ r: S \times A \rightarrow \mathbb{R} \longrightarrow \text{reward function}

Goal of RL (Finite Horizon / Episodic)

\underbrace{p_{\theta}(s_1,a_1,\dots,s_T,a_T)}_{\mathclap{\mathbb{P}_\theta(\tau), \text{where $\tau$ represents sequence of trajectory}}}=\mathbb{P}(s_1)\prod_{t=1}^T \underbrace{\pi_{\theta(a_t|s_t)} \mathbb{P}(s_{t+1}|s_t,a_t)}_{\mathclap{\text{Markov chain on $(s,a)$}}}

So our goal is to maximize the expectation of rewards

\theta^* = \argmax_\theta \mathbb{E}_{\tau \sim p_\theta}[\sum_t r(s_t,a_t)]

Group action and state together to form augmented states

The markov chain can be described using:

\begin{split} \begin{bmatrix} s_{t+1} \\ a_{t+1} \end{bmatrix} = T\begin{bmatrix} s_t \\ a_t \end{bmatrix} \\ \begin{bmatrix} s_{t+k} \\ a_{t+k} \end{bmatrix} = T^k \begin{bmatrix} s_t \\ a_t \end{bmatrix} \end{split}

So now we can modify the goal a bit towards the markov chain.

\begin{split} \theta^* &= \argmax_\theta \mathbb{E}_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t,a_t)] \\ &= \argmax_\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t)\sim p_\theta(s_t,a_t)}[r(s_t,a_t)] \end{split}

Goal of RL (Infinite Horizon)

What if $T = \infin$ ? (Infinite horizon)

We need some way to make objective finite
1. Use average award $\theta^* = \argmax_\theta \frac{1}{T}\mathbb{E}{\tau \sim p\theta(\tau)}[\sum_t r(s_t,a_t)]$
1. Use discounts

Q: Does $p(s_t,a_t)$ converge to a stationary distribution as $t \rightarrow \infin$ ?

Ergodic Property: Every state can transition into another state with a non-zero probability.

“Stationary”: The same before and after transition

\underbrace{\mu = T \mu}_{\text{stationary means the state is same and before after transition}}

(T-I)\mu = 0

This means that $\mu$ is an eigenvector of $T$ with eigenvalue of 1 (always exists under some regularity conditions)

\mu = p_\theta(s,a) \rightarrow \text{stationary distribution}

As $t$ goes further and further from infinity, the reward terms starts to get dominated by the stationary distribution (we have almost inifite $\mu$ states and a finite non-stationary action/state pairs

\begin{split} \lim_{T \rightarrow \infin} \theta^* &= \lim_{T \rightarrow \infin} \argmax_\theta \frac{1}{T}\sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p\theta(s_t,a_t)}[r(s_t,a_t)] \\ &= \argmax_{\theta} \mathbb{E}_{(s,a) \sim p_\theta(s,a)}[r(s,a)] \end{split}

Expectations and Stochastic Systems

RL is about maximizing the expectation of rewards. And expectations can be continuous in the parameters of the corresponding distributions even when the function we are taking expectation of is itself highly discontinuous. Because of this property we can use smooth optimization methods like Gradient Descent

📌

Because the distribution of the probability of getting a reward usually has a continuous PDF.

High-level anatomy of reinforcement learning algorithms

Which parts are expensive?

Value Function & Q Function

Remember that RL problems are framed as argmax of expectation of rewards, but how do we actually deal with the expectations?

📌

Write it out recursively!

We can write this expectation out explicitly as a nested expectation (looks like a DP algorithm!)

\mathbb{E}_{\tau \sim p(s_1)}[\sum_{t=1}^T r(s_t,a_t)] = \\ \mathbb{E}_{s_1 \sim p(s_1)}[\mathbb{E}_{a_1 \sim \pi(a_1|s_1)}[r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}_{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1]|s_1]]

❓

Looks like its messy, but what if we have a function that tells us what value inside the

\mathbb{E}_{a_1 \sim \pi(a_1|s_1)}

is? Specifically, its the value of

r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1]

We will define it as our

Q

function

Q(s_1,a_1) = r(s_1,a_1)+\mathbb{E}_{s_2 \sim p(s_2|s_1,a_1)}[\mathbb{E}_{a_2 \sim \pi(a_2|s_2)}[r(s_2,a_2)+\cdots|s_2]|s_1,a_1]

Then our equation is simplified into

\mathbb{E}_{\tau \sim p(s_1)}[\sum_{t=1}^T r(s_t,a_t)] = \\ \mathbb{E}_{s_1 \sim p(s_1)}[\mathbb{E}_{a_1 \sim \pi(a_1|s_1)}[Q(s_1,a_1)|s_1]]

It’s a lot easier to modify the parameter $\theta$ of $\pi_\theta(a_1|s_1)$ if $Q(s_1,a_1)$ is known

Q-Function Definition

Q^{\pi}(s_t,a_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t,a_t] \rightarrow \text{total reward from taking $a_t$ in $s_t$}

Value Function Definition

V^{\pi}(s_t)=\sum_{t'=t}^T \mathbb{E}_{\pi_\theta}[r(s_{t'},a_{t'})|s_t] \rightarrow \text{total reward from $s_t$}

V^{\pi}(s_t) = \mathbb{E}_{a_t \sim \pi(a_t|s_t)}[Q^{\pi}(s_t,a_t)]

Now $\mathbb{E}_{s_1}\sim p(s_1)[V^{\pi}(s_1)]$ is the RL objective.

To use those two functions:

If we have policy $\pi$ , and we know $Q^{\pi}(s,a)$ , then we can improve $\pi$ :

\text{set } \pi^{'}(a|s)=1 \text{ if } a = \argmax_a Q^{\pi}(s,a)

(This policy is at least as good as $\pi$ and probably better)

Compute gradient to increase the probability of good actions $a$ if $Q^{\pi}(s,a) > V^{\pi}(s)$ , then $a$ is better than average. Recall that $V^{\pi}(s) = \mathbb{E}[Q^{\pi}(s,a)]$ under $\pi(a|s)$ .

Modify $\pi(a|s)$ to increase the probability of a if $Q^{\pi}(s,a) > V^{\pi}(s)$

Bellman Operator

First defined in fixed Q-Iteration

Information Theory (LEC 14)

Some useful identities:

$p(x)$ - distribution (over observations x)

$H(p(x))= -\mathbb{E}_{x \sim p(x)}[\log p(x)]$ ⇒ entropy (how “broad” p(x) is)

$I(x;y) = D_{KL}(p(x,y)||p(x)p(y)) = \mathbb{E}_{(x,y) \sim p(x,y)}[\log \frac{p(x,y)}{p(x) p(y)}] = H(p(y)) - H(p(y|x))$
1. mutual information between two variables
1. reduction in entropy after knowing $x$
1. See “empowerment” (Polani et al.) for a typical use
  1. $I(s_{t+1}; a_t) = H(s_{t+1}) - H(s_{t+1}|a_t)$
  1. Can be viewed as quantifying “control authority” in an information-theoretic way