In the Vanilla version of GAN, we do not have any control on the features of the images that are being generated. For example, with the Vanilla GAN trained over a collection of celebrity images, we can generate new realistic celebrity pictures. However, if we want to generate a new celebrity image with certain specific attributes like: a male celebrity, who has a thick beard, and wearing glasses, we cannot. This post is about a GAN architecture called InfoGAN which helps us to gain such control over the attributes of the images being generated.
Although it is not possible to generate images with desired attributes using traditional GAN, it is well known that the noise vector inputs of such GANs inherently capture certain structure of the images being generated by it. For instance, the following exercise confirms the existence of such hidden structures: An arithmetic on the noise vector inputs of male and female celebrities result in generation of a female image with glasses starting from a male image with glasses.
- The objective of InfoGAN is to capture the correlations between the inputs of generator and the attributes of the generated images in a systematic way.
- This is done by introducing a latent code which is fed as input to the generator along with the noise vector. This latent code is expected to capture the structure of the generated images.
- This is achieved by adding an additional regularisation term to the loss function of GAN called the mutual information between the latent code and the generated image.
- The architecture is shown below. The generator takes two inputs namely the latent code and the noise vector. The discriminator produces two outputs: a) The probability of the image being real, b) The likelihood of the latent code of the image.
- The likelihood output of the discriminator is used in the loss function of the InfoGAN to approximate the mutual information between latent code and generated image.
The results of InfoGAN trained on the MNIST hand-written digit dataset with a latent code c of length three, where each element of code is hoped to capture the number, thickness and rotation of the digit namely are shown below. It can be observed that the variation of the each element of the latent code results in a meaningful variation in the generated images.
The images in the post are borrowed from the following blogs and paper, which provide a more detailed descriptions of the InfoGAN:
3. Original paper: https://arxiv.org/pdf/1606.03657.pdf