Auto-Encoding Variational Bayes

In the last post I have introduced the probabilistic programming. The biggest problem this idea is to find an efficient approximation of the posterior for arbitrary probabilistic models. Auto-Encoding Variational Bayes (AEVB) is a great step into the right direction.

Consider a dataset X\in D \times N. It consists of N i.i.d. continues samples of dimension D. The data is assumed to be generated by a random process from an unobserved continuous random variable z. The generation process looks as following:

First the latent variable z is sampled from the prior distribution p_\theta(z).

Second x is sampled from the conditional distribution p_\theta(x|z).

The parameters of the distributions are summarized in \theta. It is important to note that the distributions are differentiable w.r.t. to the parameters \theta and the latent variables z. The posterior p_\theta(z|x) is in general intractable. This is even the cases for moderately complicated likelihood functions, e.g. a neural network with a nonlinear hidden layer. We approximate the posterior with a recognition model q_\phi(z|x). The unobserved variable z has an interpretation  as a latent representation or code. Therefor, q_\phi(z|x) can be seen as a probabilistic encoder and p_\theta(x|z) as a probabilistic decoder p_\theta(x|z).

Our goal is to maximize the marginal likelihood p_\theta(x), with the approximation in mind.

\underset{\theta, \phi}{\text{max }}p_\theta(x)

Since all samples x are i.i.d and the logarithm is monotonically increasing, the problem can be rewritten as:

\underset{\theta, \phi}{\text{max }}\sum_{i=1}^{N}log(p_\theta(x^{(i)})

It can be shown that the marginal likelihood of the datapoint x^{(i)} has a lower bound. It is often refered to as the evidence lower bound (ELBO).

\log(p_\theta(x^{(i)})\geq \mathcal{L}(\theta,\phi x^{(i)})


\mathcal{L}(\theta,\phi;x^{(i)})=-D_{KL}(q_\phi(z|x^{(i)} \| p_\theta(z))+\mathbb{E}_{q_\phi(z|x^{(i)})}[\log p_\theta(x^{(i)}|z)]

D_{KL} is the Kullback-Leibler divergence. It is a similarity measurement between two probability distributions. This loss function has connections to an auto-encoder. The Kullback-Leibler divergence acts as a regularizer and the second term is an expected negative reconstruction error.

The loss function can be optimized by gradient ascent. Therefor, we need to know the gradient of \mathcal{L} with respect to \theta and \phi. This is not possible,  since z can only be obtained by sampling from q_\phi(z|x^{(i)}) and we cannot calculate gradients for it. The idea now is to reparameterize z, so that it becomes a deterministic function with the auxiliary variable \epsilon as input.


\epsilon\sim p(\epsilon)

For example in the gaussian case z\sim \mathcal{N}(\mu,\sigma^2), a valid reparametrization is

z=\mu+\sigma \epsilon

\epsilon \sim \mathcal{N}(0,1).

\mu and \sigma are parameterized functions, which depend on x^{(i)}, like for example neural networks. Neural networks have the advantage to be very flexible.

Today there are libraries available, which take the derivation of a given function automatically, e.g. Theano or Tensor Flow. That is very useful, since taking the derivations manually is a very timeconsuming and error-prone task.

You can have a look at my implementation of (AEVB). The code is available at GitHub. I have tested it with the MNIST dataset, which consists of images of numbers from zero to nine. The parameters have been choosen quite randomly and were not optimized. The optimization algorithm is plain old stochastic gradient decent (SGD). Furthermore, I have some problems with the stability of the algorithm, but nevertheless I could produce some good looking results.

Figure 1: 2D feature space of the MNIST imiges

AEVB can be used as dimensionsionality reduction algorithm. In Figure 2 you can see a 2D feature space. Different color correspond to different numbers. That means AEVB manages to cluster the same numbers and seperate different numbers, to a certain degree.

Figure 2: Randomly generated numbers

Also we can generate new samples from the model, which can be seen in Figure 2. The images are similiar to actual MNIST images, but a little bit blurry.

Overall I am sure, that the algorithm can be enhanced by optimizing the hyperparameters and using better optimizers like AdaGrad, RMSProp or Adam. The main point of this implementation is to better understand the idea and principles of AEVB and variational inference in general. That is something I have definitively achieved.


One Comment