Supervised Learning of Policies (Behavior Cloning)
Imitation Learning
In practice, after a long time of training it actually worked.
“tightrope walker scenerio”: has to follow exact same behavior otherwise the model does not know what to do
Three angles supervised to steer differently! This trick helped a lot
More like in training data we have scenerios to train the machine how to correct its little mistakes.
Instead of being clever about , let’s be clever about ⇒ Do DAgger(Data Aggregation)
DAgger (Data Aggregation)
Idea: To collect data from instead of
How: Just run but we need labels
- train on human dataset
- run to obtain
- Ask humans to label with actions
- Aggregate
But:
- Lots of the problem is in step 3
- We as humans have to watch for feedback and then give optimal actions, cannot just watch a state and give output
- What if the model does NOT drift?
- Need to mimic expert behavior very accurately
Why might we fail the expert?
- Non-Markovian Behavior
- Unnatural for humans to do perfect Markovian actions: Humans more like ⇒ based on previous observations
- Solve this by adding RNN / LSTM cells to the network (Fig. 1)
- May still work poorly (Figure 2)
- Multimodal Behavior (Multiple Modes/Peaks in real distribution)
- If discrete actions, then this is not an issue
- But continuous requires exponential discrete bins (for softmax)
- Solved by
- Output mixture of Gaussians
-
- Tradeoffs:
- need more output parameters
- The ability to model with this in high dimensions is challenging (in theory the number of mixture elements rises exponentially with dimension increase)
- Latent variable models
- The output is still Gaussian
- In addition to inputting an image into NN, we also input a latent variable(may be drawn from a prior distribution) into the model.
- Conditional Variational Autoencoders (VAEs)
- Normalizing Flow/realNVP
- Stein Variational Gradient Descent
- Complex to train
- Audoregressive discretization
- discretizes one dimension at a time using an NN trick ⇒ never incur the exponential cost
- Adds a softmax for one dimension, does a discrete sampling (and obtains a dim 1 value), feed the dim 1 value into another NN and softmax layer, do sampling, obtain dim 2 value, so on.
- Have to pick bins
- Output mixture of Gaussians
- If discrete actions, then this is not an issue
Questions:
- Does including history info (LSTM/RNN) mitigate causal confusion?
- My guess is no since history has no correlation with confusion
- Can DAgger mitigate causal confusion?
- My guess is yes since the model will confuse and then this the part of data that the model is confused on will then be manually labeled.
Whats wrong with imitation learning:
- Humans need to provide data, which is limited
- DL works best when data is plentiful
- Humans are not good at providing some actions
- Humans can learn autonomously, can we do the same on machines?
- Unlimited data from own experience
- continuous self-improvement
Analysis of Imitation Learning
Ross et al. “Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning”
We assume:
then…
with DAgger
Without DAgger
We cannot assume anything for
But we can bound the total variation divergence of training and testing
Since for ,
So finally
Another way to imitate (Goal Conditional Behavioral Cloning)
After clarification from class:
- Sometimes we have bad “demonstrations” that may lead to different end states
- Those demonstrations may be close to each other enough that we can train some “shared policy” for each of those different end states (in a sense that most of previous states can be obtained by a shared weight)
- During production / test, specify the end state you want to achieve and then we can predict
“Learning to Reach Goals via Iterated Supervised Learning” - Dibya, Abhishek