To generate captions describing a given image in different styles. See the figure below for an example.
- An encoder-decoder model for generating semantically relevant styled captions is proposed.
- First, this model maps the image into a semantic term representation via the term generator, then the language generator uses these terms to generate a caption in the target style. This is illustrated in Figure 2.
- The lower left of Figure 2 describes the term generator, which takes an image as input, extracts features using a CNN and then generates an ordered term sequence summarising the image semantics.
- The upper right of Figure 2 describes the language generator, which takes the term sequence as input, encodes it with an RNN (Recurrent Neural Network) and then using an attention based RNN decodes it into natural language with a specific style.
- In particular, the image feature is extracted from the second last layer of the Inception-v3 CNN pre-trained on ImageNet. It passes through a densely connected layer, and then is provided as input to an RNN with Gated Recurrent Unit (GRU) cells.
- The training uses existing image caption datasets with only factual descriptions, plus a large set of styled texts without aligned images. To achieve this, a two-stage training strategy for the term generator and language generator is developed.
- For training the term generator, the ground truth semantic sequence for each image is constructed from the corresponding ground truth descriptive captions. The loss function used is the mean categorical cross entropy over semantic terms.
- To create training data for language generator, a training sentence is taken and is mapped to a semantic sequence according to the steps in Section 3.1. The loss function is categorical cross entropy.