Prove that our RL algorithms will work perfectly every time
Usually not possible with current deep RL methods, which are often not even guarenteed to converge
Understand how errors are affected by problem parameters
Do larger discounts work better than smaller ones?
If we want half the eror, do we need 2x the samples, 4x, or something else?
Usually we use precise theory to get imprecise qualitative conclusions about how various factors influence the performance of RL algorithms under strong assumptions, and try to make the assumptions reasonable enough that these conclusions are likely to apply to real problems (but they are not guarenteed to apply to real problems)
Exploration
Performance of RL is greatly complicated by exploration - how likely are we to find potentially sparse rewards?
Theoretical guarantees typically address worst case performance ⇒ but worst case exploration is extremely hard
If we “abstract away” exploration, how many samples do we need to effectively learn a model or value function that results in good performance?
“generative model” assumption: assume we can sample from P(s’∣s,a) for any (s,a)
“oracle exploration” assumption: For every (s,a), sample s’∼P(s’∣s,a) N times
Basic Sample Complexity Analysis
RL Theory Textbook. Agarwal, Jiang, Kakade, Sun.
“Oracle Exploration”: for every (s,a), sample s’∼P(s’∣s,a) N times
Simple “model-based” algorithm
P^(s’∣s,a)=N#(s,a,s’)
Count number of transitions
Given π, use P^ to estimate Q^π
We want to measure worst case performance
How close is Q^π to Qπ?
How close is Q^∗(optimal function learned under P^) to Q∗(optimal function learned under real P)?
How good is the resulting policy?
“How close is Q^π to Qπ?”
“How close is Q^∗ to Q∗?”
“How good is the resulting policy?”
Concentration Inequalities
How fast our estimate of random variable converges to the true underlying value (in terms of # of samples)
Suppose X1,…,Xn are a sequence of independent, identically distributed (i.i.d.) random variables with mean μ. Let Xnˉ=n−1∑i=1nXi. Suppose that Xi∈[b−,b+] with probability 1, then
Therefore if we estimate μ with n samples the probability we’re off by more than ϵ is at most 2exp{(b+−b−)22nϵ2} (since we can be off on both sides). So if we want this probability to be δ:
Concentration for Discrete Distributions
Let z be a discrete random variable that takes values in {1,2,…,d}, distributed according to q. We can write q as a vector where q=[P(z=j)]j=1d. Assume we have N iid samples, and that our empiraical state of q is [q^]j=∑i=1N1{zi=j}/N
We have that ∀ϵ>0,
Which implies:
Let
Then
Using concentration inequalities, we see that in the “Basic Sample Complexity Analysis”, the state transition estimation difference is
Now: Relate error in P^ to error in Q^π
Recall:
So if we write out the transition fuction P, the Q function Qπ, reward function r, and value function Vπ as matrices:
Remember that V function is just an expectation over the Q function (a sum) ⇒ so we can also write it as an expectation (where Π∈R∣S∣×(∣S∣∣A∣) and is a probability matrix)
With this, (and Pπ=PΠ)
So
And turns out the matrix (I−γPπ) is always gonna be invertible
Simulation Lemma:
Proof:
Another Lemma:
Given Pπ and any vector v∈R∣S∣∣A∣, we have
“Q function” corresponding to “reward” v is at most (1−γ)−1 times larger
Where does this (1−γ)−1 come from? Geometric series of ∑t=0∞γtc=1−γc
Let
Then
Putting the lemmas together,
Let the special case v=γ(P−P^)Vπ
We can bound ∣∣Vπ∣∣∞
With this bound,
With the previous bound on
We conclude:
What about ∣∣Q∗−Q^∗∣∣∞?
Supremium - lower upper bound of a function (greatest value of a function), one useful identity
And
What about ∣∣Q∗−Qπ^∗∣∣∞?
Fixed Q-Iteration
Abstract model of exact Q-iteration
Abstract model of approximate fitted Q-iteration:
Where T^ is the approximate bellman operator:
Note: We will not be computing P^ and r^, it would just be the effect of averaging together transitions in the data (having different gradients for different samples)
We will assume an infinity norm here because there will be no convergnce if using L2 norm
Erorrs come from
T=T^
Sampling Error
Q^k+1=T^Q^k
Approximation Error
Sampling Error:
Note:
∣r^(s,a)−r(s,a)∣
Estimation error of continuous random variable, use Hoeffding’s Inequality!
≤2Rmax2Nlog(δ−1)
∣EP^[maxa′Q(s′,a′)]−EP[maxa′Q(s′,a′)]∣
=∑s’(P^(s’∣s,a)−P(s’∣s,a))maxa’Q(s’,a’)
≤∑s’∣P^(s’∣s,a)−P(s’∣s,a)∣maxs’,a’Q(s’,a’)
=∣∣P^(⋅∣s,a)−P(⋅∣s,a)∣∣1∣∣Q∣∣∞
≤c∣∣Q∣∣∞Nlog(δ−1)
So
And
Approximation Error:
Assume error: ∣∣Q^k+1−TQ^k∣∣∞∣∣∞≤ϵk
This is a strong assumption!
Putting it together:
Note:
From previous proofs, Sampling Error,Approx Error∈O((1−γ)−1)
More advanced results can be derived with p-norms under some distribution