Paper: https://arxiv.org/pdf/1808.06601.pdf

Code: https://github.com/NVIDIA/vid2vid

Problem Definition

The objective is to learn a mapping function that can convert source images s to output sequences, x1 so that the conditional distribution of x1 given s identical to the conditional distribution of x given s. x is the sequence of real images corresponding to s.

Input: Sequence of source frames

Output: Photorealistic representation of source frames

Architecture

Generator architecture is shown below

VID2VID_GEN

The generator takes source images and previous frames as input and generates an intermediate frame and flow map, the flow map is used to warp the previous frame which is then combined with the intermediate frame to output final frame. The final frame then becomes output when generating the next frame.
Two discriminators are used, a) image discriminator which takes input maps and output images and outputs multiple feature scales. b) video discriminator takes flowmaps and neighbouring frames as inputs to ensure temporal consistency.

Datasets

  • Cityscapes: videos and their corresponding segmentation masks for each frame.

  • Apolloscape: similar to cityscapes

  • FaceForensics: videos of news briefing from different reporters. Face sketches are obtained using various computer vision techniques.

Key Points

  • Spatio-temporally progressive training: The network first trains on low resolution and generates high resolution by combining features from high resolution. The training starts with couple of frames and progressively the frames are increased. Spatial training and temporal are alternated.
  • Markov assumption: To simplify the problem and to generate the frames sequentially.
  • The optical flow mask output is used to warp the current frame to generate an estimation of the next frame.
  • To prevent mode collapse two discriminator networks are used viz.
    • Conditional image discriminator: To ensure that each output frame resembles a real image given the same source image.
    • Conditional video discriminator: To ensure that consecutive output frames resemble the temporal dynamics of a real video given the same optical flow. This discriminator paired with the objective function enforces/ensures temporal coherence is consecutive frames of the video.
  • Loss function: A multi task loss function is used to learn the objective function.
  • Foreground-background prior: when using semantic segmentation masks the images was divided into foreground and background based on semantics. For example, buildings and roads belong to the background, while cars and pedestrians are considered as the foreground. This prior when incorporated into objective function improved performance.