Optimal Control and Planning Algorithms
Model free completely discard the transition dynamics by sampling from the data points
- But we often do know the dynamics
- Games
- Easily Modeled Systems
- Simulated Environments
- Or we can learn the dynamics
- System identification (fit unknown parameters of a known model)
- Learning (fit a general-purpose model to observed transition data)
Knowing the dynamics make things easier
But how do we extract policies from a known dynamics?
Objective of optimal control:
Open-loop Planning
This is sub-optimal because we are not considering that later when we have executed some actions ⇒ we may get supplemented with other information.
Stochastic Optimization
Abstracts away optimal control/planning ⇒ just optimizes a function
Simplest Method: Guess and Check (”Random Shooting” method)
- pick A1,…,AN from some distribution (e.g. uniform)
- Choose Ai based on argmaxiJ(Ai)
Cross-Entropy Method (CEM)
Iterative improvement of the “random shooting method”
- sample A1,…,AN from p(A) (typically start from Gaussian distribution)
- Evaluate J(A1),…,J(AN)
- Pick the elites Ai1,…,AiM with the highest value (M<N)
- Refit p(A) to the elites Ai1,…,AiM
- Repeat this process
See also CMA-ES ⇒ CEM with momentum
- Very fast if parallelized
- Extremely Simple
- But very harsh dimensionality limit
- Only open-loop planning
Close-loop Planning
Monte-Carlo Tree Search (MCTS)
Discrete Planning can be viewed as a tree search problem but problem is that with vanilla tree search, the complexity is exponential in time steps
So can we restrict the number of searches without expanding the whole tree?
The idea is to stop at depth x and with the state, approximate the outcomes that come after by using a baseline policy.
Whicih path do we search first?
Choose nodes with best reward, but also prefer rarely visited nodes
So the algorithm:
- Find a leaf sl using TreePolicy(s1)
- Evaluate the leaf using DefaultPolicy(sl)
- Start from the root and execute every action in the tree to arrive at the leaf / a new leaf(due to randomness)
- Update all values in tree between s1 and sl
- Repeat this iteration and then choose best action from s1
UCT TreePolicy(st)
- If st not fully expanded, choose new at else choose child with best Score (st+1)
Browne, Powley, Whitehouse, Lucas, Cowling, Rohlfshagen, Travener, Perez, Samothrakis, Colton. (2012). A Survey of Monte Carlo Tree Search Methods
Shootting Methods vs. Collocation
- Shotting method: optimize over actions only
- Earlier actions have larger effects on the trajectory
- Collocation method: Optimize over actions and states, with constraints
Trajectory Optimization
Because this is a trajectory optimization lecture, we will use xt for state and ut for action
If we wanna use gradients, it usually helps to use a 2nd order method to optimize via backpropagation!
Linear Case: Linear Quadratic Regularizer (LQR)
Objective:
We will assume that
Base case: Solve for uT only
We can break CT and cT into pieces that affects u and affects x
(Note that here we assumed CuT,xT=CxT,uT⊤
Subce uT is fully determined by xT, we can write our cost function as
where
So what if we solve for uT−1? ⇒ uT−1 affects xT!
Combining the two equations
Now we can derive the derivative
Where
So looks like we can start from the last timestep and do backward recursion!
After doing backward recursion, we can compute our ut and xt by forward recursion
LQR for Stochastic and Nonlinear Systems
Stochastic dynamics:
If
Then just choose actions according to the linear case (since the randomness is symmetric)
Nonlinear case - Differentiable Dynamic Programming (DDP) / Iterative LQR (ILQR)
Can we approximate a nonlinear system as a linear-quadratic system? (nearby)
We will denote the change in xt,ut as δxt,δut
Where Ft=∇xt,utf(x^t,u^t), Ct=∇xt,ut2c(x^t,u^t), ct=∇xt,utc(x^t,u^t)
Now we can use LQR with dynamics fˉ, cost cˉ to find δxt and δut
IQLR is basically just an approximation of Neuton’s method for solving the control objective.
But we used a first-order dynamics approximation here, to get Newton’s method, we need to use a second order dynamics approximation (DDP)
Additional Reading
- Mayne, Jacobson. (1970). Differential dynamic programming.
- Original differential dynamic programming algorithm.
- Tassa, Erez, Todorov. (2012). Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization.
- Practical guide for implementing non-linear iterative LQR.
- Levine, Abbeel. (2014). Learning Neural Network Policies with Guided
Policy Search under Unknown Dynamics.
- Probabilistic formulation and trust region alternative to deterministic line search.