Paper: https://arxiv.org/pdf/1805.07030.pdf

Code: https://github.com/matasukef/papers/issues/43

Problem Definition

To generate captions describing a given image in different styles. See the figure below for an example.

VID2VID_GEN

Architecture

  • An encoder-decoder model for generating semantically relevant styled captions is proposed.
  • First, this model maps the image into a semantic term representation via the term generator, then the language generator uses these terms to generate a caption in the target style. This is illustrated in Figure 2.
  • The lower left of Figure 2 describes the term generator, which takes an image as input, extracts features using a CNN and then generates an ordered term sequence summarising the image semantics.
  • The upper right of Figure 2 describes the language generator, which takes the term sequence as input, encodes it with an RNN (Recurrent Neural Network) and then using an attention based RNN decodes it into natural language with a specific style.
  • In particular, the image feature is extracted from the second last layer of the Inception-v3 CNN pre-trained on ImageNet. It passes through a densely connected layer, and then is provided as input to an RNN with Gated Recurrent Unit (GRU) cells.

VID2VID_GEN

Training details

  • The training uses existing image caption datasets with only factual descriptions, plus a large set of styled texts without aligned images. To achieve this, a two-stage training strategy for the term generator and language generator is developed.
  • For training the term generator, the ground truth semantic sequence for each image is constructed from the corresponding ground truth descriptive captions. The loss function used is the mean categorical cross entropy over semantic terms.
  • To create training data for language generator, a training sentence is taken and is mapped to a semantic sequence according to the steps in Section 3.1. The loss function is categorical cross entropy.

Datasets

  • For training the term generator, MSCOCO dataset is used which consists 82783 training images and 40504 validation images, with 5 descriptive captions each
  • For training the language generator, the data set consists of 1567 romance novels from bookcorpus