Optimal Control and Planning Algorithms (and LQR)

Optimal Control and Planning Algorithms

🔥

In Model-based learning, we need some optimal policy extraction steps to be able to extract a policy from a environment.

Model free completely discard the transition dynamics by sampling from the data points

But we often do know the dynamics
1. Games
1. Easily Modeled Systems
1. Simulated Environments

Or we can learn the dynamics
1. System identification (fit unknown parameters of a known model)
1. Learning (fit a general-purpose model to observed transition data)

Knowing the dynamics make things easier

But how do we extract policies from a known dynamics?

Objective of optimal control:

\min_{a_1, \dots, a_T} \mathbb{E}(c(s_1,s_2,\dots,s_T)|a_1,\dots,a_T)

Open-loop Planning

a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} \mathbb{E}[\sum_t r(s_t,a_t)|a_1,\dots,a_T]

This is sub-optimal because we are not considering that later when we have executed some actions ⇒ we may get supplemented with other information.

Stochastic Optimization

Abstracts away optimal control/planning ⇒ just optimizes a function

a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} J(a_1,\dots,a_T) \rightarrow A = \argmax_{A} J(A)

Simplest Method: Guess and Check (”Random Shooting” method)

pick $A_1, \dots, A_N$ from some distribution (e.g. uniform)

Choose $A_i$ based on $\argmax_i J(A_i)$

Cross-Entropy Method (CEM)

Iterative improvement of the “random shooting method”

sample $A_{1}, \dots, A_{N}$ from $p(A)$ (typically start from Gaussian distribution)

Evaluate $J(A_1), \dots, J(A_N)$

Pick the elites $A_{i_1}, \dots, A_{i_M}$ with the highest value ( $M < N$ )

Refit $p(A)$ to the elites $A_{i_1}, \dots, A_{i_M}$

Repeat this process

See also CMA-ES ⇒ CEM with momentum

Very fast if parallelized

Extremely Simple

But very harsh dimensionality limit

Only open-loop planning

Close-loop Planning

Monte-Carlo Tree Search (MCTS)

Discrete Planning can be viewed as a tree search problem but problem is that with vanilla tree search, the complexity is exponential in time steps

So can we restrict the number of searches without expanding the whole tree?

The idea is to stop at depth $x$ and with the state, approximate the outcomes that come after by using a baseline policy.

Whicih path do we search first?

Choose nodes with best reward, but also prefer rarely visited nodes

So the algorithm:

Find a leaf $s_l$ using $\text{TreePolicy}(s_1)$

Evaluate the leaf using $\text{DefaultPolicy}(s_l)$
1. Start from the root and execute every action in the tree to arrive at the leaf / a new leaf(due to randomness)

Update all values in tree between $s_1$ and $s_l$

Repeat this iteration and then choose best action from $s_1$

UCT TreePolicy( $s_t$ )

If $s_t$ not fully expanded, choose new $a_t$ else choose child with best Score ( $s_{t+1}$ )

Score(s_t)=\frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2 \ln N(S_{t-1}}{N(s_t)}}

Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Travener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods

Shootting Methods vs. Collocation

Shotting method: optimize over actions only
1. Earlier actions have larger effects on the trajectory

Collocation method: Optimize over actions and states, with constraints

Trajectory Optimization

Because this is a trajectory optimization lecture, we will use $x_t$ for state and $u_t$ for action

If we wanna use gradients, it usually helps to use a 2nd order method to optimize via backpropagation!

Linear Case: Linear Quadratic Regularizer (LQR)

Objective:

\min_{u_1,\dots,u_T} c(x_1,u_1)+c(f(x_1,u_1),u_2)+\cdots+c(f(f(\dots)\dots),u_T)

We will assume that

f(x_t,u_t)=F_t\begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t

c(x_t,u_t)=\frac{1}{2} \begin{bmatrix} x_t &u_t \end{bmatrix} C_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + \begin{bmatrix} x_t &u_t \end{bmatrix} c_t

Base case: Solve for $u_T$ only

Q(x_T,u_T)=\text{const}+\frac{1}{2}\begin{bmatrix} x_T \\ u_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ u_T \end{bmatrix} + \begin{bmatrix} x_T \\ u_T \end{bmatrix}^\top c_T

We can break $C_T$ and $c_T$ into pieces that affects $u$ and affects $x$

(Note that here we assumed $C_{u_T, x_T} = C_{x_T,u_T}^\top$

C_T = \begin{bmatrix} C_{x_T,X_T} &C_{x_T,u_T} \\ C_{u_T,x_T} &C_{u_T,u_T} \end{bmatrix}

c_T = \begin{bmatrix} c_{x_T} \\ c_{u_T} \end{bmatrix}

\nabla_{u_T} Q(x_T,u_T) = C_{u_T,x_T} x_T + C_{u_T,u_T} u_T + c_{u_T}^{\top} = 0 \\ u_T = -C_{u_T,u_T}^{-1}(C_{u_T,x_T}x_T+c_{u_T}) = K_Tx_T+k_T

Subce $u_T$ is fully determined by $x_T$ , we can write our cost function as

\begin{split} V(x_T) &= \text{const} + \frac{1}{2} \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix}^{\top} C_T \begin{bmatrix} x_T \\ K_Tx_T+k_T \end{bmatrix} + \begin{bmatrix} x_T \\ K_T x_T + k_T \end{bmatrix}^\top c_T \\ &=\text{const} + \frac{1}{2} x_T^\top V_T x_T + x_T^\top v_T \end{split}

where

V_T = C_{x_T, x_T}+C_{x_T, u_T}K_T + K_T^\top C_{u_T,x_T} + K_T^\top C_{u_T,u_T}K_T \\ v_T = c_{x_T}+C_{x_T,u_T}k_T+K_T^\top C_{u_T} + K_T^\top C_{u_T, u_T}k_T

So what if we solve for $u_{T-1}$ ? ⇒ $u_{T-1}$ affects $x_T$ !

f(x_{T-1},u_{T-1})=x_T = F_{T-1}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + f_{T-1}

Q(x_{T-1},u_{T-1})=\text{const}+\frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top C_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^{\top} c_{T-1} + V(f(x_{T-1},u_{T-1}))

V(x_T)=\text{const} + \frac{1}{2} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T F_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top V_T f_{T-1} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top F_{T-1}^\top v_T

Combining the two equations

Q(x_{T-1},u_{T-1})=\text{const} + \frac{1}{2}\begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top Q_{T-1} \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix} + \begin{bmatrix} x_{T-1} \\ u_{T-1} \end{bmatrix}^\top q_{T-1}

Q_{T-1} = C_{T-1} + F_{T-1}^\top V_T F_{T-1} \\ q_{T-1} = c_{T-1} + F_{T-1}^\top V_T f_{T-1} + F_{T-1}^\top v_T

Now we can derive the derivative

\nabla_{u_{T-1}} Q(x_{T-1}, u_{T-1}) = Q_{u_{T-1},x_{T-1}}x_{T-1} + Q_{u_{T-1},u_{T-1}} u_{T-1} + q_{u_{T-1}}^\top = 0 \\ u_{T-1} = K_{T-1}x_{T-1} + k_{T-1}

Where

K_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} Q_{u_{T-1},x_{T-1}} \\ k_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} q_{u_{T-1}}

So looks like we can start from the last timestep and do backward recursion!

After doing backward recursion, we can compute our $u_t$ and $x_t$ by forward recursion

LQR for Stochastic and Nonlinear Systems

Stochastic dynamics:

p(x_{t+1}|x_t,u_t) = N(F_t \begin{bmatrix} x_t \\ u_t \end{bmatrix} + f_t, \Sigma_t )

Then just choose actions according to the linear case (since the randomness is symmetric)

Nonlinear case - Differentiable Dynamic Programming (DDP) / Iterative LQR (ILQR)

Can we approximate a nonlinear system as a linear-quadratic system? (nearby)

f(x_t,u_t) \approx f(\hat{x}_t,\hat{u}_t)+\nabla_{x_t,u_t} f(\hat{x}_t,\hat{u}_t)\begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}

c(x_t,u_t) \approx c(\hat{x}_t, \hat{u}_t)+\nabla_{x_t,u_t} c(\hat{x}_t,\hat{x}_t)+\frac{1}{2} \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{x}_t \end{bmatrix}^\top \nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix} x_t - \hat{x}_t \\ u_t - \hat{u}_t \end{bmatrix}

We will denote the change in $x_t, u_t$ as $\delta x_t, \delta u_t$

\bar{f}(\delta x_t, \delta u_t) = F_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}

\bar{c}(\delta x_t, \delta u_t) = \frac{1}{2} \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top C_t \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix} + \begin{bmatrix} \delta x_t \\ \delta u_t \end{bmatrix}^\top c_t

Where $F_t = \nabla_{x_t,u_t} f(\hat{x}_t, \hat{u}_t)$ , $C_t = \nabla_{x_t,u_t}^2 c(\hat{x}_t, \hat{u}_t)$ , $c_t = \nabla_{x_t,u_t} c(\hat{x}_t, \hat{u}_t)$

Now we can use LQR with dynamics $\bar{f}$ , cost $\bar{c}$ to find $\delta x_t$ and $\delta u_t$

IQLR is basically just an approximation of Neuton’s method for solving the control objective.

But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)

🔥

But this may be a bad idea because we are always going to the approximated minima and the approximation is only usable within a short range We can just add a constant

\alpha

term to the

k_t

in the forward pass ⇒ allows us how much we want to control from our starting point (the smaller

\alpha

is the closer we are to the original starting point)

Additional Reading

Mayne, Jacobson. (1970). Differential dynamic programming.
1. Original differential dynamic programming algorithm.

Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.
1. Practical guide for implementing non-linear iterative LQR.

Levine, Abbeel. (2014). Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics.
1. Probabilistic formulation and trust region alternative to deterministic line search.