Open problems in RL

Challenges with core algorithms:

Stability: does your algorithm converge?

Efficiency: how long does it take to converge? (how many samples)

Generalization: after it converges, does it generalize?

Challenges with assumptions:

Is this even the right problem formulation?

What is the source of supervision?

Stability

Devising stable RL algorithms is very hard

Q-learning/value function estimation
- Fitted Q/fitted value methods with deep network function
  - estimators are typically not contractions, hence no guarantee of convergence
- Lots of parameters for stability:
  - target network delay, replay buffer size, clipping, sensitivity to learning rates, etc.

Policy gradient/likelihood ratio/REINFORCE
- Very high variance gradient estimator
- Lots of samples, complex baselines, etc.
- Parameters: batch size, learning rate, design of baseline

Model-based RL algorithms
- Model class and fitting method
- Optimizing policy w.r.t. model non-trivial due to backpropagation through time
- More subtle issue: policy tends to exploit the model

Sample Efficiency

Need to wait for a long time for your homework to finish running

Real-world learning becomes difficult or impractical

Precludes the use of expensive, high-fidelity simulators

Limits applicability to real-world problems

Scaling up deep RL & Generalization

Supervised Learning (with Imagenet)

Large-scale

Emphasizes diversity

Evaluated on generalization

RL:

Small-scale

Emphasizes mastery

Evaluated on performance

Supervision

Where do supervision come from?

For atari games, it is natural for supervision (reward) to come from the game score

But there are other form of supervisions other than rewards that we can potentially incorporate
- Demonstrations
  - Muelling, K et al. (2013). Learning to Select and Generalize Striking Movements in Robot Table Tennis
- Language
  - Andreas et al. (2018). Learning with latent language
- Human preferences
  - Christiano et al. (2017). Deep reinforcement learning from human preferences

Rethinking the problem formulation

How should we define a control problem?
- What is the data?
- What is the goal?
- What is the supervision?
  - may not be the same as the goal...

Think about the assumptions that fit your problem setting!

Don’t assume the basic RL problem is set in stone

Some perspectives

RL as an Engineering Tool
- “Anything you can simulate you can control”
  - Before: characterize, simulate, control
  - Now: characterize, simulate, run RL
- Powerful inversion engine ⇒ but still needs to simulate!

RL as the Real World
- How do we engineer a system that can deal with the unexpected?
  - Minimal external supervision about what to do
  - Unexpected situations that require adaptation
  - Must discover solutions autonomously
- Humans extremely good at this, current AI systems are extremely bad at this
- RL in theory can do this, and nothing else can
  - But we rarely study this kind of setting in RL research
  - “easy universe”
    - Success = high reward
    - close world, rules unknown
    - lots of simulation
    - “Can RL algorithms optimize really well?”
  - “hard universe”
    - success = survival (good enough control)
    - open world, everything must come from data
    - no simulation (rules unknown)
    - “Can RL generalize and adapt?”
- Some questions that come up
  - How do we tell RL algorithms what we want them to do?
  - How can we learn fully autonomously in continual environments?
  - How to remain robust as the world changes around us?
  - What is the right way to generalize using experience & prior data?
  - What’s the right way to bootstrap exploration with prior experience?
  - Can we run fully autonomously?

RL as “Universal” Learning
- "We need machine learning for one reason and one reason only – that’s to produce adaptable and complex decisions.”
- Can we learn from offline data without well-defined tasks?
  - Inspired by large language models learning from internet texts
  - self-supervised learning
  - Chebotar, Hausman, Lu, Xiao, Kalashnikov, Varley, Irpan, Eysenbach, Julian, Finn, Levine. Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. 2021.