Open problems in RL
Challenges with core algorithms:
- Stability: does your algorithm converge?
- Efficiency: how long does it take to converge? (how many samples)
- Generalization: after it converges, does it generalize?
Challenges with assumptions:
- Is this even the right problem formulation?
- What is the source of supervision?
Stability
- Devising stable RL algorithms is very hard
- Q-learning/value function estimation
- Fitted Q/fitted value methods with deep network function
- estimators are typically not contractions, hence no guarantee of convergence
- Lots of parameters for stability:
- target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.
- Fitted Q/fitted value methods with deep network function
- Policy gradient/likelihood ratio/REINFORCE
- Very high variance gradient estimator
- Lots of samples, complex baselines, etc.
- Parameters: batch size, learning rate, design of baseline
- Model-based RL algorithms
- Model class and fitting method
- Optimizing policy w.r.t. model non-trivial due to backpropagation through time
- More subtle issue: policy tends to exploit the model
Sample Efficiency
- Need to wait for a long time for your homework to finish running
- Real-world learning becomes difficult or impractical
- Precludes the use of expensive, high-fidelity simulators
- Limits applicability to real-world problems
Scaling up deep RL & Generalization
Supervised Learning (with Imagenet)
- Large-scale
- Emphasizes diversity
- Evaluated on generalization
RL:
- Small-scale
- Emphasizes mastery
- Evaluated on performance
Supervision
Where do supervision come from?
- For atari games, it is natural for supervision (reward) to come from the game score
- But there are other form of supervisions other than rewards that we can potentially incorporate
- Demonstrations
- Muelling, K et al. (2013). Learning to Select and Generalize Striking Movements in Robot Table Tennis
- Language
- Andreas et al. (2018). Learning with latent language
- Human preferences
- Christiano et al. (2017). Deep reinforcement learning from human preferences
- Demonstrations
Rethinking the problem formulation
- How should we define a control problem?
- What is the data?
- What is the goal?
- What is the supervision?
- may not be the same as the goal...
- Think about the assumptions that fit your problem setting!
- Don’t assume the basic RL problem is set in stone
Some perspectives
- RL as an Engineering Tool
- “Anything you can simulate you can control”
- Before: characterize, simulate, control
- Now: characterize, simulate, run RL
- Powerful inversion engine ⇒ but still needs to simulate!
- “Anything you can simulate you can control”
- RL as the Real World
- How do we engineer a system that can deal with the unexpected?
- Minimal external supervision about what to do
- Unexpected situations that require adaptation
- Must discover solutions autonomously
- Humans extremely good at this, current AI systems are extremely bad at this
- RL in theory can do this, and nothing else can
- But we rarely study this kind of setting in RL research
- “easy universe”
- Success = high reward
- close world, rules unknown
- lots of simulation
- “Can RL algorithms optimize really well?”
- “hard universe”
- success = survival (good enough control)
- open world, everything must come from data
- no simulation (rules unknown)
- “Can RL generalize and adapt?”
- Some questions that come up
- How do we tell RL algorithms what we want them to do?
- How can we learn fully autonomously in continual environments?
- How to remain robust as the world changes around us?
- What is the right way to generalize using experience & prior data?
- What’s the right way to bootstrap exploration with prior experience?
- Can we run fully autonomously?
- How do we engineer a system that can deal with the unexpected?
- RL as “Universal” Learning
- "We need machine learning for one reason and one reason only – that’s to produce adaptable and complex decisions.”
- Can we learn from offline data without well-defined tasks?
- Inspired by large language models learning from internet texts
- self-supervised learning
- Chebotar, Hausman, Lu, Xiao, Kalashnikov, Varley, Irpan, Eysenbach, Julian, Finn, Levine. Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. 2021.