Paper: https://arxiv.org/pdf/1807.02570.pdf
Problem Definition
The objective is to estimate 3D depth of each pixel in the given monocular image.
Input: Monocular Image
Output: Right and Left disparity maps
Architecture
- The StackNet has two sub-networks in this network, they are 1) SimpleNet 2)Residual Net.
- The SimpleNet has an encoder-decoder architecture, it generates 4 pairs of disparity maps with 4 different resolutions.
- The Residual Net’s inputs are left image, disparity map, left image(reconstructed), right image(reconstructed) and l1 reconstruction error between left-image and left-image (reconstructed).
Datasets
-
KITTI Vsion benchmark: dataset with various computer vision benchmarks (related to autonomous driving).
-
Cityscapes: a large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames.
Key Points
- The StackNet takes the left monocular image as the input and predicts both right and left disparity maps in a virtual baseline stereo setup.
- For every new keyframe the depth of reconstructed points is initialised using the left disparity. For optimizing in the keyframe window, given a virtual baseline stereo, the estimated points are re-projected onto a virtual stereo right image. WIth right disparity the points are warped back to left stereo image.
- Together with photo consistency and temporal multiview term the total photometric error is optimized.