Paper: https://arxiv.org/pdf/1812.04948.pdf

Code: Coming soon

Problem Definition

To generate high quality synthetic human faces with control over various features (pose, identity, hair etc) and also solve the feature disentanglement problem.

Architecture

style-gan-arch

  • There are three sub-networks Mapping Network, Synthesis Network and ProGAN discriminator.
  • The Mapping Network transforms the random vector into an intermediate vector whose different elements control different visual features.
  • The synthesis network transforms the intermediate vector into images by adding scaled noise to each channel. Adding noise changes the visual expression of the output based on the resolution level it was added on.

Key Points

  • This is an unconditional GAN.
  • Instead of using a random vector, here they used a constant intermediate vector learned by Mapping Network.
  • The StyleGAN generator uses the intermediate vector in each level of the synthesis network, which might cause the network to learn that levels are correlated. To reduce the correlation, the model randomly selects two input vectors and generates the intermediate vector ⱳ for them. It then trains some of the levels with the first and switches (in a random point) to the other to train the rest of the levels. The random switch ensures that the network won’t learn and rely on a correlation between levels.
  • To control the generation of uncanny images, the $w$ vector is truncated, forcing it to stay close to the "average" intermediate vector.
  • To measure feature disentanglement two novel ways were introduced:a) Perceptual path length:  measure the difference between consecutive images (their VGG16 embeddings) when interpolating between two random inputs. Drastic changes mean that multiple features have changed together and that they might be entangled. b) Linear separability — the ability to classify inputs into binary classes, such as male and female. The better the classification the more separable the features

Comments

  • I think the reason to using the peculialrly high number of linear layers in Mapping Network is to disentangle the latent space (it reduces uncanniness in the generated images). But the authors didn't discuss much about this.