Variational Inference

Latent Variable Model

$p_\theta(x|z)$ ⇒ Latent Model (VAE) that transforms a proposed state $z$ to $x$

We have data points $x$ , we want to model a distribution of $x$ with the help of latent variable $z$ .

How do we train latent variable models?

Model $p_\theta(\vec{x})$

Data: $D = \{ x_1, x_2, x_3, \dots, x_N \}$

MLE fit: $\theta \leftarrow \argmax_{\theta} \frac{1}{N} \sum_i \log p_\theta(x_i)$
- $p(x) = \int p(x|z) p(z) dz$
- $\theta \leftarrow \argmax_\theta \frac{1}{N} \sum_i \log (\underbrace{\int p_\theta(x_i|z)p(z)dz}_{\text{completely untractable}})$

Estimating the log-likelihood

🧙🏽‍♂️

Use Expected Log-likelihood

\theta \leftarrow \argmax_\theta \frac{1}{N} \sum_i \mathbb{E}_{z \sim p(z|x_i)}[\log p_\theta(x_i,z)]

Intuition:

Guess most likely $z$ given $x_i$ , and pretend it is the right one
- But there are many possible values of $z$ so use the distribution $p(z|x_i)$

But how do we calculate $p(z|x_i)$ ?
- Approximate with $q_i(z) = N(\mu_i, \sigma_i)$
  - Note that with each data point $x_i$ , they all have different $q_i$
  - With any $q_i(z)$ , we can construct a lower bound on $\log p(x_i)$
  - Maximize this bound (assume this bound is tight), we can push up on $\log p(x_i)$

\begin{split} \log p(x_i) &= \log \int_z p(x_i|z) p(z) \\ &= \log \int_z p(x_i|z)p(z)\frac{q_i(z)}{q_i(z)} \\ &=\log \mathbb{E}_{z \sim q_i(z)} [\frac{p(x_i|z)p(z)}{q_i(z)}] \end{split}

Note:

Yensen’s Inequality (For concave function)

\log \mathbb{E}[y] \ge \mathbb{E}[\log y]

\begin{split} \log p(x_i) &=\log \mathbb{E}_{z \sim q_i(z)} [\frac{p(x_i|z)p(z)}{q_i(z)}] \\ &\ge \mathbb{E}_{z \sim q_i(z)} [\log \frac{p(x_i|z)p(z)}{q_i(z)}] \\ &\quad = \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] - \mathbb{E}_{z \sim q_i(z)} [\log q_i(z)] \\ &\quad = \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] + H(q_i) \end{split}

Note about entropy:

H(p) = -\mathbb{E}_{x \sim p(x)}[\log p(x)] = -\int_x p(x)\log p(x) dx

Intuitions:

How random is the random variable?

How large is the log probability in expectation under itself?

Also note about KL-Divergence:

D_{KL}(q ||p) = \mathbb{E}_{x \sim q(x)}[log \frac{q(x)}{p(x)}] = \mathbb{E}_{x \sim q(x)} [\log q(x)] - \mathbb{E}_{x \sim q(x)}[\log p(x)] = -\mathbb{E}_{x \sim q(x)}[\log p(x)] - H(q)

Intuitions:

How different are two distributions?

How small is the expected log probability of one distribution under another, minus entropy?

So the variational approximation

\begin{split} \log p(x_i) &\ge \overbrace{\mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z) + \log p(z)] + H(q_i)}^{L_i(p,q_i)} \\ \end{split}

So what makes a good $q_i(z)$ ?

$q_i(z)$ should approximate $p(z|x_i)$
- Compare in terms of KL-divergence $D_{KL}(q_i(z) || p(z|x))$

Why?

\begin{split} D_{KL}(q_i(z) || p(z|x_i)) &= \mathbb{E}_{z \sim q_i(z)}[\log \frac{q_i(z)}{p(z|x_i)}] = \mathbb{E}_{z \sim q_i(z)}[\log \frac{q_i(z) p(x_i)}{p(x_i,z)}] \\ &= - \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z)] + \log p(z)] + \mathbb{E}_{z \sim q_i(z)}[\log q_i(z)] + \mathbb{E}_{z \sim q_i(z)}[\log p(x_i)] \\ &= - \mathbb{E}_{z \sim q_i(z)}[\log p(x_i|z)] + \log p(z)] - H(q_i) + \mathbb{E}_{z \sim q_i(z)}[\log p(x_i)] \\ &=- L_i(p,q_i) + \log p(x_i) \end{split}

Also:

\log p(x_i) = D_{KL}(q_i(x_i)||p(z|x_i)) + L_i(p,q_i)

In fact minimizing the KL-divergence has the effect of tightening then bound!

But how to improve $q_i$ ?

We can let $q_i = N(\mu_i, \sigma_i)$ and use gradient $\nabla_{\mu_i} L_i(p,q_i)$ and $\nabla_{\sigma_i} L_i(p,q_i)$
- But too many parameters! We have $|\theta| + (|\mu_i|+|\sigma_i|) \times N$ parameters

Then can we use
- $q_i(z) = q_\phi(z|x_i) \approx p(z|x_i)$ ?
- Amortized Variational Inference!

Amortized Variational Inference

L_i = \underbrace{\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z) + \log p(z)]}_{J(\phi) = \mathbb{E}_{z \sim q_\phi(z|x_i)}[r(x_i,z)]} + H(q_\phi(z|x_i))

We can directly calculate the derivative for the entropy term but the first term $J(\phi)$ seems a bit problematic.

But this format looks a lot like policy gradient!

\nabla J(\phi) \approx \frac{1}{M} \sum_j \nabla_\phi \log q_\phi(z_j|x_i) r(x_i,z_j)

What’s wrong with this gradient?

Perfectly viable approach, but not best approach

Tends to suffer from high variance

The reparameterization trick

⚠️

In RL we cannot use this trick because we cannot calculate gradient through the transition dynamics, but with variational inference we can

q_\phi(z|x) = N(\mu_\phi(x), \sigma_\phi(x)) \\ z = \mu_\phi(x) + \epsilon \sigma_\phi(x), \epsilon \sim N(0,1)

This makes $z$ an independent function of a random variable $\epsilon$ that is independent of $\phi$

So now:

\begin{split} J(\phi) &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[r(x_i,z)] \\ &=\mathbb{E}_{\epsilon \sim N(0,1)}[r(x_i,\mu_\phi(x_i)+\epsilon \sigma_\phi(x_i))] \end{split}

And now to estimate $\nabla_{\phi} J(\phi)$

sample $\epsilon_1, \dots, \epsilon_M$ from $N(0,1)$ (even a single sample works well)
- $\epsilon$ treated as a constant when computing gradient

$\nabla_\phi J(\phi) \approx \frac{1}{M} \sum_j \nabla_\phi r(x_i, \mu_\phi(x_i) + \epsilon_j \sigma_\phi(x_i))$

This is useful because now we are using the derivative of the $r$ function

Another Perspective of $L_i$

\begin{split} L_i &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z) + \log p(z)] + H(q_\phi(z|x_i)) \\ &=\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)] + \underbrace{\mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p(z)] + H(q_\phi(z|x_i))}_{-D_{KL}(q_\phi(z|x_i)||p(z))} \\ &= \mathbb{E}_{z \sim q_\phi(z|x_i)}[\log p_\theta(x_i|z)]-D_{KL}(q_\phi(z|x_i)||p(z)) \\ &=\mathbb{E}_{\epsilon \sim N(0,1)}[\log p_\theta(x_i|\mu_\phi(x_i) + \epsilon \sigma_\phi(x_i))]-D_{KL}(q_\phi(z|x_i)||p(z)) \\ &\approx \log p_\theta(x_i | \mu_\phi(x_i) + \epsilon \sigma_\phi (x_i))-D_{KL}(q_\phi(z|x_i)||p(z)) \\ \end{split}

Reparameterization trick vs. Policy Gradient

Policy Gradient
- Can handle both discrete and continuous latent variables
- High variance, requires multiple samples & small learning rates

Reparameterization Trick
- Only continuous latent variables
- Simple to implement & Low variance

Example Models

VAE Variational Autoencoder

\max_{\theta, \phi} \frac{1}{N} \sum_i \log p_\theta(x_i|\mu_\phi(x_i) + \epsilon \sigma_\phi(x_i)) - D_{KL}(q_\phi(z|x_i)||p(z))

Conditional Models

Generates $y$ given $x, z$

L_i = \mathbb{E}_{z \sim q_\phi(z|x_i, y_i)}[\log p_\theta(y_i|x_i,z) + \log p(z|x_i)]+H(q_\phi(z|x_i,y_i))