Generative Adversarial Network (GAN) is one of the most interesting and popular class of generative networks in deep learning. GANs are finding applications in a variety of fields ranging from automated vehicles to medical imaging. Since their inception in 2014, there have been a lot of improvements in their architectures. This post tries to cover some of these important architectures.

Illustration of some celebrity images generated by one of the recent architectures of GANs called Progressive GAN

The beginning of the GANs

Ian Goodfellow's paper published in 2014 is the first one to introduce the architecture of GAN. It has two neural networks, a generator and a discriminator, both of which are trained together while competing against each other. The generator is getting trained to fool the discriminator by creating fake images that are supposed to be realistic, and the discriminator is getting trained to not to be fooled by the generator.

GAN training overview
Basic architecture of GAN

The basic architecture is shown in the figure above. First, the generator takes a noise vector as an input and generates an image by upsampling the noise vector. The discriminator is fed with these generated images and also some real images from the dataset, and it learns to distinguish between them. The generator receives feedback from the discriminator on the quality of the generated images through backpropagation which is used to create more realistic images. To know further, here is a link for a blog post and an implementation code.

The evolution of GANs

Following the original GAN paper, there have been a number of papers on GANs which either address some of their limitations or modify their architecture to suit for some new applications. In this post, we will look at the following three aspects in which GANs were improved:

  1. While GANs generate new realistic images, how do we get some control on the nature of the images being generated? For example, if we have a labelled set of training images, can we generate image of a particular label?
  2. GANs are known to suffer from stability issues in training. Further it is not easy to figure out if the GAN has converged by looking at the loss value. This is because, there does not seem to be a correlation between low loss and the quality of the images being generated. How to improve these training issues?
  3. Basically GANs take noise vector as input and gives a new image as an output. Can we modify the architecture of GANs to do image-to-image translations instead of noise-to-image translations?

1. Gaining some control over generated images

(a) Conditional GAN: If we have labelled training data, this architecture helps us generate specific images from generator. In particular, in this architecture, we have some conditional information Y that describes some aspect of the data. Most commonly, the class label is given as the conditional information. For example, if we are dealing with MNIST dataset, Y could describe the digit representing the image. Then, this attribute information is inserted in both the generator and the discriminator. Once the training is done, if we want to generate a image of a particular label, we can simply feed the generator with the corresponding label as Y. The basic architecture is in the figure below. For further details, this blog post or the original paper can be referred. Pytorch implementation can be found here.

Architecture of Conditional GAN

(b) InfoGAN: What if do not have labelled data? Can we still get some control over the generated images? Yes, this architecture called InfoGAN which can be viewed as alternative version of the conditional GAN works with unlabelled data. This architecture learns the hidden correlations between the noise inputs of GAN and the corresponding generated images. This reference post has more specific details about this GAN architecture.

(c) TL-GAN: Transparent latent-space GAN is another interesting architecture which also deals with similar techniques. However, it is applicable for labelled data. This blog post has an interesting demo of this GAN, which generates celebrity images with specific required attributes as shown below.

An interactive demo based on TL-GAN

2. Improving the training issues of GANs

(a) Wasserstein GAN (WGAN): This paper introduces a new loss function for GANs based on a metric called the Wasserstein distance. Using this loss has resulted in two improvements: (i) The new loss function has correlation with image quality, i.e., a low loss  corresponds to a better generated image quality, which was missing in the original version of GANs. (ii) Solved training stability issues such as mode collapse. For further details, refer this blog post.

Loss function plot of WGAN. Lower the loss, better the image quality.

Few variants of Wasserstein GANs such as WGAN-GP have also been proposed which result in further improvement of training stability. In fact, the celebrity images shown in the beginning of this post were based on WGAN-GP.

(b) Boundary Equilibrium GANs (BEGANs): Roughly speaking, this paper stabilises the training process and improve the quality of generated images by making two significant differences compared to traditional GAN architectures.

(i) Uses an auto encoder as a discriminator instead of using a classifier. In this architecture, the objective of the discriminator is to minimise the reconstruction loss of real training data, and to maximise the reconstruction loss of generated data when passed through its auto encoder. The objective of the generator is to produce data which will have low reconstruction error when passed through the discriminator. While the traditional GAN tries to match the distribution of training and generated data, this network tries to match the distributions of the reconstruction loss (when passed through the discriminator) of training data and generated data. (ii) Introduces a mechanism to balance the learning between generator and discriminator to ensure a stable training process.

Boundary Equilibrium GAN architecture

More specific details about the training loss and the balancing mechanism can be found in this blog which summarises this architecture.

3. GANs for Style transfer (Image-to-Image translations)

While the conventional GAN architecture takes a noise vector as input and converts it into an image, the following GAN architectures takes an image in one domain and transforms it into an image in to another domain. Look at the figures below for few examples.

Translates a zebra into horse while retaining everything else in the input image
Generates an image from its segmentation map.

There are primarily two types of architectures for these kind of image-image translations depending on the type of training data available. If we have paired training data, i.e., set of data of the form {(x,y)} where x is the image in source domain and y is the corresponding image, then a GAN architecture called Pix2Pix. If we do not have paired dataset, architectures like CycleGAN can be used.

(a) Pix2Pix: A Conditional GAN based architecture is used, which tries to learn the joint distribution of the paired data. The generator is based on a U-Net based architecture. It has a discriminator called PatchGAN discriminator which classifies individual patches in the image as “real vs. fake”, instead of classifying the entire image as “real vs. fake”. The authors argue that this helps in ensuring sharp high-frequency detail in the generated images. Finally in the loss objective function, in addition to the usual adversarial loss in GANS, it uses an L1 distance term between the generated image and target image. More specific details can be found in this blog post.

(b) CycleGAN: This architecture works even when there is no paired training data. All that we need is a set of images in the source domain and a set of images in the target domain. No one-one correspondence is required between the images. The key idea is to use two generator like architectures, where the first one translates a source domain image into target domain, and second one tries to reconstruct the source domain image from the target domain image. An L2 reconstruction loss is used between the source image and its reconstructed image, which is added to the usual adversarial loss function of the GANs. The specific details of this architecture can be found in this reference post.

Further Reading

  1. This blog post contains detailed summaries of different GAN architectures and their evolution history.
  2. This is nice collection of articles related to GANs and their applications.