Problem Definition

The objective is to estimate 3D depth of each pixel in the given monocular image.

Input: Monocular Image

Output: Right and Left disparity maps



  • The StackNet has two sub-networks in this network, they are 1) SimpleNet 2)Residual Net.
  • The SimpleNet has an encoder-decoder architecture, it generates 4 pairs of disparity maps with 4 different resolutions.
  • The Residual Net’s inputs are left image, disparity map, left image(reconstructed), right image(reconstructed) and l1 reconstruction error between left-image and left-image (reconstructed).


  • KITTI Vsion benchmark: dataset with various computer vision benchmarks (related to autonomous driving).

  • Cityscapes: a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.

Key Points

  • The StackNet takes the left monocular image as the input and predicts both right and left disparity maps in a virtual baseline stereo setup.
  • For every new keyframe the depth of reconstructed points is initialised using the left disparity. For optimizing in the keyframe window, given a virtual baseline stereo, the estimated points are re-projected onto a virtual stereo right image. WIth right disparity the points are warped back to left stereo image.
  • Together with photo consistency and temporal multiview term the total photometric error is optimized.