Optimal Control and Planning Algorithms 🔥
In Model-based learning, we need some optimal policy extraction steps to be able to extract a policy from a environment.
Model free completely discard the transition dynamics by sampling from the data points
But we often do know the dynamicsGames Easily Modeled Systems Simulated Environments Or we can learn the dynamicsSystem identification (fit unknown parameters of a known model) Learning (fit a general-purpose model to observed transition data) Knowing the dynamics make things easier But how do we extract policies from a known dynamics?
Objective of optimal control:
min a 1 , … , a T E ( c ( s 1 , s 2 , … , s T ) ∣ a 1 , … , a T ) \min_{a_1, \dots, a_T} \mathbb{E}(c(s_1,s_2,\dots,s_T)|a_1,\dots,a_T) a 1 , … , a T min E ( c ( s 1 , s 2 , … , s T ) ∣ a 1 , … , a T )
Open-loop Planning
a 1 , … , a T = arg max a 1 , … , a T E [ ∑ t r ( s t , a t ) ∣ a 1 , … , a T ] a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} \mathbb{E}[\sum_t r(s_t,a_t)|a_1,\dots,a_T] a 1 , … , a T = a 1 , … , a T arg max E [ t ∑ r ( s t , a t ) ∣ a 1 , … , a T ] This is sub-optimal because we are not considering that later when we have executed some actions ⇒ we may get supplemented with other information.
Stochastic Optimization Abstracts away optimal control/planning ⇒ just optimizes a function
a 1 , … , a T = arg max a 1 , … , a T J ( a 1 , … , a T ) → A = arg max A J ( A ) a_1,\dots,a_T = \argmax_{a_1,\dots,a_T} J(a_1,\dots,a_T) \rightarrow A = \argmax_{A} J(A) a 1 , … , a T = a 1 , … , a T arg max J ( a 1 , … , a T ) → A = A arg max J ( A ) Simplest Method: Guess and Check (”Random Shooting” method)
pick A 1 , … , A N A_1, \dots, A_N A 1 , … , A N from some distribution (e.g. uniform) Choose A i A_i A i based on arg max i J ( A i ) \argmax_i J(A_i) arg max i J ( A i )
Cross-Entropy Method (CEM)
Iterative improvement of the “random shooting method”
sample A 1 , … , A N A_{1}, \dots, A_{N} A 1 , … , A N from p ( A ) p(A) p ( A ) (typically start from Gaussian distribution) Evaluate J ( A 1 ) , … , J ( A N ) J(A_1), \dots, J(A_N) J ( A 1 ) , … , J ( A N ) Pick the elites A i 1 , … , A i M A_{i_1}, \dots, A_{i_M} A i 1 , … , A i M with the highest value (M < N M < N M < N ) Refit p ( A ) p(A) p ( A ) to the elites A i 1 , … , A i M A_{i_1}, \dots, A_{i_M} A i 1 , … , A i M Repeat this process See also CMA-ES ⇒ CEM with momentum
Very fast if parallelized Extremely Simple But very harsh dimensionality limit Only open-loop planning
Close-loop Planning Monte-Carlo Tree Search (MCTS) Discrete Planning can be viewed as a tree search problem but problem is that with vanilla tree search, the complexity is exponential in time steps So can we restrict the number of searches without expanding the whole tree?
The idea is to stop at depth x x x and with the state, approximate the outcomes that come after by using a baseline policy.
Whicih path do we search first? Choose nodes with best reward, but also prefer rarely visited nodes
So the algorithm:
Find a leaf s l s_l s l using TreePolicy ( s 1 ) \text{TreePolicy}(s_1) TreePolicy ( s 1 ) Evaluate the leaf using DefaultPolicy ( s l ) \text{DefaultPolicy}(s_l) DefaultPolicy ( s l ) Start from the root and execute every action in the tree to arrive at the leaf / a new leaf(due to randomness) Update all values in tree between s 1 s_1 s 1 and s l s_l s l Repeat this iteration and then choose best action from s 1 s_1 s 1
UCT TreePolicy(s t s_t s t )
If s t s_t s t not fully expanded, choose new a t a_t a t else choose child with best Score (s t + 1 s_{t+1} s t + 1 ) S c o r e ( s t ) = Q ( s t ) N ( s t ) + 2 C 2 ln N ( S t − 1 N ( s t ) Score(s_t)=\frac{Q(s_t)}{N(s_t)}+2C\sqrt{\frac{2 \ln N(S_{t-1}}{N(s_t)}} S core ( s t ) = N ( s t ) Q ( s t ) + 2 C N ( s t ) 2 ln N ( S t − 1 Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Travener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods
Shootting Methods vs. Collocation Shotting method: optimize over actions onlyEarlier actions have larger effects on the trajectory Collocation method: Optimize over actions and states, with constraints Trajectory Optimization Because this is a trajectory optimization lecture, we will use x t x_t x t for state and u t u_t u t for action If we wanna use gradients, it usually helps to use a 2nd order method to optimize via backpropagation!
Linear Case: Linear Quadratic Regularizer (LQR) Objective:
min u 1 , … , u T c ( x 1 , u 1 ) + c ( f ( x 1 , u 1 ) , u 2 ) + ⋯ + c ( f ( f ( … ) … ) , u T ) \min_{u_1,\dots,u_T} c(x_1,u_1)+c(f(x_1,u_1),u_2)+\cdots+c(f(f(\dots)\dots),u_T) u 1 , … , u T min c ( x 1 , u 1 ) + c ( f ( x 1 , u 1 ) , u 2 ) + ⋯ + c ( f ( f ( … ) … ) , u T ) We will assume that
f ( x t , u t ) = F t [ x t u t ] + f t f(x_t,u_t)=F_t\begin{bmatrix}
x_t \\
u_t
\end{bmatrix} + f_t f ( x t , u t ) = F t [ x t u t ] + f t c ( x t , u t ) = 1 2 [ x t u t ] C t [ x t u t ] + [ x t u t ] c t c(x_t,u_t)=\frac{1}{2} \begin{bmatrix}
x_t &u_t
\end{bmatrix}
C_t
\begin{bmatrix}
x_t \\
u_t
\end{bmatrix}
+
\begin{bmatrix}
x_t &u_t
\end{bmatrix}
c_t
c ( x t , u t ) = 2 1 [ x t u t ] C t [ x t u t ] + [ x t u t ] c t Base case: Solve for u T u_T u T only
Q ( x T , u T ) = const + 1 2 [ x T u T ] ⊤ C T [ x T u T ] + [ x T u T ] ⊤ c T Q(x_T,u_T)=\text{const}+\frac{1}{2}\begin{bmatrix}
x_T \\
u_T
\end{bmatrix}^{\top}
C_T
\begin{bmatrix}
x_T \\
u_T
\end{bmatrix}
+
\begin{bmatrix}
x_T \\
u_T
\end{bmatrix}^\top
c_T Q ( x T , u T ) = const + 2 1 [ x T u T ] ⊤ C T [ x T u T ] + [ x T u T ] ⊤ c T We can break C T C_T C T and c T c_T c T into pieces that affects u u u and affects x x x
(Note that here we assumed C u T , x T = C x T , u T ⊤ C_{u_T, x_T} = C_{x_T,u_T}^\top C u T , x T = C x T , u T ⊤
C T = [ C x T , X T C x T , u T C u T , x T C u T , u T ] C_T = \begin{bmatrix}
C_{x_T,X_T} &C_{x_T,u_T} \\
C_{u_T,x_T} &C_{u_T,u_T}
\end{bmatrix} C T = [ C x T , X T C u T , x T C x T , u T C u T , u T ] c T = [ c x T c u T ] c_T = \begin{bmatrix}
c_{x_T} \\
c_{u_T}
\end{bmatrix} c T = [ c x T c u T ] ∇ u T Q ( x T , u T ) = C u T , x T x T + C u T , u T u T + c u T ⊤ = 0 u T = − C u T , u T − 1 ( C u T , x T x T + c u T ) = K T x T + k T \nabla_{u_T} Q(x_T,u_T) = C_{u_T,x_T} x_T + C_{u_T,u_T} u_T + c_{u_T}^{\top} = 0 \\
u_T = -C_{u_T,u_T}^{-1}(C_{u_T,x_T}x_T+c_{u_T}) = K_Tx_T+k_T ∇ u T Q ( x T , u T ) = C u T , x T x T + C u T , u T u T + c u T ⊤ = 0 u T = − C u T , u T − 1 ( C u T , x T x T + c u T ) = K T x T + k T Subce u T u_T u T is fully determined by x T x_T x T , we can write our cost function as
V ( x T ) = const + 1 2 [ x T K T x T + k T ] ⊤ C T [ x T K T x T + k T ] + [ x T K T x T + k T ] ⊤ c T = const + 1 2 x T ⊤ V T x T + x T ⊤ v T \begin{split}
V(x_T)
&= \text{const} + \frac{1}{2} \begin{bmatrix}
x_T \\
K_Tx_T+k_T
\end{bmatrix}^{\top}
C_T
\begin{bmatrix}
x_T \\
K_Tx_T+k_T
\end{bmatrix}
+
\begin{bmatrix}
x_T \\
K_T x_T + k_T
\end{bmatrix}^\top
c_T \\
&=\text{const} + \frac{1}{2} x_T^\top V_T x_T + x_T^\top v_T
\end{split} V ( x T ) = const + 2 1 [ x T K T x T + k T ] ⊤ C T [ x T K T x T + k T ] + [ x T K T x T + k T ] ⊤ c T = const + 2 1 x T ⊤ V T x T + x T ⊤ v T where
V T = C x T , x T + C x T , u T K T + K T ⊤ C u T , x T + K T ⊤ C u T , u T K T v T = c x T + C x T , u T k T + K T ⊤ C u T + K T ⊤ C u T , u T k T V_T = C_{x_T, x_T}+C_{x_T, u_T}K_T + K_T^\top C_{u_T,x_T} + K_T^\top C_{u_T,u_T}K_T \\
v_T = c_{x_T}+C_{x_T,u_T}k_T+K_T^\top C_{u_T} + K_T^\top C_{u_T, u_T}k_T V T = C x T , x T + C x T , u T K T + K T ⊤ C u T , x T + K T ⊤ C u T , u T K T v T = c x T + C x T , u T k T + K T ⊤ C u T + K T ⊤ C u T , u T k T So what if we solve for u T − 1 u_{T-1} u T − 1 ? ⇒ u T − 1 u_{T-1} u T − 1 affects x T x_T x T !
f ( x T − 1 , u T − 1 ) = x T = F T − 1 [ x T − 1 u T − 1 ] + f T − 1 f(x_{T-1},u_{T-1})=x_T = F_{T-1}\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}
+ f_{T-1} f ( x T − 1 , u T − 1 ) = x T = F T − 1 [ x T − 1 u T − 1 ] + f T − 1 Q ( x T − 1 , u T − 1 ) = const + 1 2 [ x T − 1 u T − 1 ] ⊤ C T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ c T − 1 + V ( f ( x T − 1 , u T − 1 ) ) Q(x_{T-1},u_{T-1})=\text{const}+\frac{1}{2}\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
C_{T-1}
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}
+
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^{\top} c_{T-1}
+
V(f(x_{T-1},u_{T-1})) Q ( x T − 1 , u T − 1 ) = const + 2 1 [ x T − 1 u T − 1 ] ⊤ C T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ c T − 1 + V ( f ( x T − 1 , u T − 1 )) V ( x T ) = const + 1 2 [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ V T F T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ V T f T − 1 + [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ v T V(x_T)=\text{const} + \frac{1}{2} \begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
F_{T-1}^\top V_T F_{T-1}
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}
+
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
F_{T-1}^\top V_T f_{T-1}
+
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
F_{T-1}^\top v_T V ( x T ) = const + 2 1 [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ V T F T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ V T f T − 1 + [ x T − 1 u T − 1 ] ⊤ F T − 1 ⊤ v T Combining the two equations
Q ( x T − 1 , u T − 1 ) = const + 1 2 [ x T − 1 u T − 1 ] ⊤ Q T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ q T − 1 Q(x_{T-1},u_{T-1})=\text{const} + \frac{1}{2}\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
Q_{T-1}
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}
+
\begin{bmatrix}
x_{T-1} \\
u_{T-1}
\end{bmatrix}^\top
q_{T-1} Q ( x T − 1 , u T − 1 ) = const + 2 1 [ x T − 1 u T − 1 ] ⊤ Q T − 1 [ x T − 1 u T − 1 ] + [ x T − 1 u T − 1 ] ⊤ q T − 1 Q T − 1 = C T − 1 + F T − 1 ⊤ V T F T − 1 q T − 1 = c T − 1 + F T − 1 ⊤ V T f T − 1 + F T − 1 ⊤ v T Q_{T-1} = C_{T-1} + F_{T-1}^\top V_T F_{T-1} \\
q_{T-1} = c_{T-1} + F_{T-1}^\top V_T f_{T-1} + F_{T-1}^\top v_T Q T − 1 = C T − 1 + F T − 1 ⊤ V T F T − 1 q T − 1 = c T − 1 + F T − 1 ⊤ V T f T − 1 + F T − 1 ⊤ v T Now we can derive the derivative
∇ u T − 1 Q ( x T − 1 , u T − 1 ) = Q u T − 1 , x T − 1 x T − 1 + Q u T − 1 , u T − 1 u T − 1 + q u T − 1 ⊤ = 0 u T − 1 = K T − 1 x T − 1 + k T − 1 \nabla_{u_{T-1}} Q(x_{T-1}, u_{T-1}) = Q_{u_{T-1},x_{T-1}}x_{T-1} + Q_{u_{T-1},u_{T-1}} u_{T-1} + q_{u_{T-1}}^\top = 0 \\
u_{T-1} = K_{T-1}x_{T-1} + k_{T-1} ∇ u T − 1 Q ( x T − 1 , u T − 1 ) = Q u T − 1 , x T − 1 x T − 1 + Q u T − 1 , u T − 1 u T − 1 + q u T − 1 ⊤ = 0 u T − 1 = K T − 1 x T − 1 + k T − 1 Where
K T − 1 = − Q u T − 1 , u T − 1 − 1 Q u T − 1 , x T − 1 k T − 1 = − Q u T − 1 , u T − 1 − 1 q u T − 1 K_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} Q_{u_{T-1},x_{T-1}} \\
k_{T-1} = -Q_{u_{T-1},u_{T-1}}^{-1} q_{u_{T-1}} K T − 1 = − Q u T − 1 , u T − 1 − 1 Q u T − 1 , x T − 1 k T − 1 = − Q u T − 1 , u T − 1 − 1 q u T − 1 So looks like we can start from the last timestep and do backward recursion!
After doing backward recursion, we can compute our u t u_t u t and x t x_t x t by forward recursion
LQR for Stochastic and Nonlinear Systems Stochastic dynamics:
If
p ( x t + 1 ∣ x t , u t ) = N ( F t [ x t u t ] + f t , Σ t ) p(x_{t+1}|x_t,u_t) = N(F_t \begin{bmatrix}
x_t \\
u_t
\end{bmatrix}
+
f_t,
\Sigma_t
) p ( x t + 1 ∣ x t , u t ) = N ( F t [ x t u t ] + f t , Σ t ) Then just choose actions according to the linear case (since the randomness is symmetric)
Nonlinear case - Differentiable Dynamic Programming (DDP) / Iterative LQR (ILQR)
Can we approximate a nonlinear system as a linear-quadratic system? (nearby) f ( x t , u t ) ≈ f ( x ^ t , u ^ t ) + ∇ x t , u t f ( x ^ t , u ^ t ) [ x t − x ^ t u t − u ^ t ] f(x_t,u_t) \approx f(\hat{x}_t,\hat{u}_t)+\nabla_{x_t,u_t} f(\hat{x}_t,\hat{u}_t)\begin{bmatrix}
x_t - \hat{x}_t \\
u_t - \hat{u}_t
\end{bmatrix} f ( x t , u t ) ≈ f ( x ^ t , u ^ t ) + ∇ x t , u t f ( x ^ t , u ^ t ) [ x t − x ^ t u t − u ^ t ] c ( x t , u t ) ≈ c ( x ^ t , u ^ t ) + ∇ x t , u t c ( x ^ t , x ^ t ) + 1 2 [ x t − x ^ t u t − x ^ t ] ⊤ ∇ x t , u t 2 c ( x ^ t , u ^ t ) [ x t − x ^ t u t − u ^ t ] c(x_t,u_t) \approx c(\hat{x}_t, \hat{u}_t)+\nabla_{x_t,u_t} c(\hat{x}_t,\hat{x}_t)+\frac{1}{2} \begin{bmatrix}
x_t - \hat{x}_t \\
u_t - \hat{x}_t
\end{bmatrix}^\top
\nabla_{x_t, u_t}^2 c(\hat{x}_t, \hat{u}_t) \begin{bmatrix}
x_t - \hat{x}_t \\
u_t - \hat{u}_t
\end{bmatrix} c ( x t , u t ) ≈ c ( x ^ t , u ^ t ) + ∇ x t , u t c ( x ^ t , x ^ t ) + 2 1 [ x t − x ^ t u t − x ^ t ] ⊤ ∇ x t , u t 2 c ( x ^ t , u ^ t ) [ x t − x ^ t u t − u ^ t ] We will denote the change in x t , u t x_t, u_t x t , u t as δ x t , δ u t \delta x_t, \delta u_t δ x t , δ u t
f ˉ ( δ x t , δ u t ) = F t [ δ x t δ u t ] \bar{f}(\delta x_t, \delta u_t) = F_t \begin{bmatrix}
\delta x_t \\
\delta u_t
\end{bmatrix} f ˉ ( δ x t , δ u t ) = F t [ δ x t δ u t ] c ˉ ( δ x t , δ u t ) = 1 2 [ δ x t δ u t ] ⊤ C t [ δ x t δ u t ] + [ δ x t δ u t ] ⊤ c t \bar{c}(\delta x_t, \delta u_t) = \frac{1}{2} \begin{bmatrix}
\delta x_t \\
\delta u_t
\end{bmatrix}^\top
C_t \begin{bmatrix}
\delta x_t \\
\delta u_t
\end{bmatrix}
+
\begin{bmatrix}
\delta x_t \\
\delta u_t
\end{bmatrix}^\top c_t c ˉ ( δ x t , δ u t ) = 2 1 [ δ x t δ u t ] ⊤ C t [ δ x t δ u t ] + [ δ x t δ u t ] ⊤ c t Where F t = ∇ x t , u t f ( x ^ t , u ^ t ) F_t = \nabla_{x_t,u_t} f(\hat{x}_t, \hat{u}_t) F t = ∇ x t , u t f ( x ^ t , u ^ t ) , C t = ∇ x t , u t 2 c ( x ^ t , u ^ t ) C_t = \nabla_{x_t,u_t}^2 c(\hat{x}_t, \hat{u}_t) C t = ∇ x t , u t 2 c ( x ^ t , u ^ t ) , c t = ∇ x t , u t c ( x ^ t , u ^ t ) c_t = \nabla_{x_t,u_t} c(\hat{x}_t, \hat{u}_t) c t = ∇ x t , u t c ( x ^ t , u ^ t )
Now we can use LQR with dynamics f ˉ \bar{f} f ˉ , cost c ˉ \bar{c} c ˉ to find δ x t \delta x_t δ x t and δ u t \delta u_t δ u t
IQLR is basically just an approximation of Neuton’s method for solving the control objective.
But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)
🔥
But this may be a bad idea because we are always going to the approximated minima and the approximation is only usable within a short range
We can just add a constant
α \alpha α term to the
k t k_t k t in the forward pass ⇒ allows us how much we want to control from our starting point (the smaller
α \alpha α is the closer we are to the original starting point)
Additional Reading Mayne, Jacobson. (1970). Differential dynamic programming.Original differential dynamic programming algorithm. Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.Practical guide for implementing non-linear iterative LQR. Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
Policy Search under Unknown Dynamics.Probabilistic formulation and trust region alternative to deterministic line search.