In the last post I have introduced the probabilistic programming. The biggest problem this idea is to find an efficient approximation of the posterior for arbitrary probabilistic models. Auto-Encoding Variational Bayes (AEVB) is a great step into the right direction.
Consider a dataset . It consists of i.i.d. continues samples of dimension D. The data is assumed to be generated by a random process from an unobserved continuous random variable . The generation process looks as following:
First the latent variable is sampled from the prior distribution .
Second is sampled from the conditional distribution .
The parameters of the distributions are summarized in . It is important to note that the distributions are differentiable w.r.t. to the parameters and the latent variables . The posterior is in general intractable. This is even the cases for moderately complicated likelihood functions, e.g. a neural network with a nonlinear hidden layer. We approximate the posterior with a recognition model . The unobserved variable has an interpretation as a latent representation or code. Therefor, can be seen as a probabilistic encoder and as a probabilistic decoder .
Our goal is to maximize the marginal likelihood , with the approximation in mind.
Since all samples are i.i.d and the logarithm is monotonically increasing, the problem can be rewritten as:
It can be shown that the marginal likelihood of the datapoint has a lower bound. It is often refered to as the evidence lower bound (ELBO).
is the Kullback-Leibler divergence. It is a similarity measurement between two probability distributions. This loss function has connections to an auto-encoder. The Kullback-Leibler divergence acts as a regularizer and the second term is an expected negative reconstruction error.
The loss function can be optimized by gradient ascent. Therefor, we need to know the gradient of with respect to and . This is not possible, since can only be obtained by sampling from and we cannot calculate gradients for it. The idea now is to reparameterize , so that it becomes a deterministic function with the auxiliary variable as input.
For example in the gaussian case , a valid reparametrization is
and are parameterized functions, which depend on , like for example neural networks. Neural networks have the advantage to be very flexible.
Today there are libraries available, which take the derivation of a given function automatically, e.g. Theano or Tensor Flow. That is very useful, since taking the derivations manually is a very timeconsuming and error-prone task.
You can have a look at my implementation of (AEVB). The code is available at GitHub. I have tested it with the MNIST dataset, which consists of images of numbers from zero to nine. The parameters have been choosen quite randomly and were not optimized. The optimization algorithm is plain old stochastic gradient decent (SGD). Furthermore, I have some problems with the stability of the algorithm, but nevertheless I could produce some good looking results.
AEVB can be used as dimensionsionality reduction algorithm. In Figure 2 you can see a 2D feature space. Different color correspond to different numbers. That means AEVB manages to cluster the same numbers and seperate different numbers, to a certain degree.
Also we can generate new samples from the model, which can be seen in Figure 2. The images are similiar to actual MNIST images, but a little bit blurry.
Overall I am sure, that the algorithm can be enhanced by optimizing the hyperparameters and using better optimizers like AdaGrad, RMSProp or Adam. The main point of this implementation is to better understand the idea and principles of AEVB and variational inference in general. That is something I have definitively achieved.