Problem Definition

To estimate the speed of vehicles from given monocular video. An example, where the vehicles are detected and their corresponding speed is identified is shown below.



  • The system consists of three components as shown in the figure: (1) vehicle detection (2) tracking, and (3) speed estimation.
  • In the first step, the vehicle detection and segmentation is done using the well known Mask-RCNN.
  • In the second step, the tracking is done using an algorithm called DEEPSORT, which not only does tracking, but also estimates the pixel velocity with the help of Kalman filters.
  • After obtaining pixel velocities from the Kalman filter in the image domain, the speed estimation approach has two steps:
    • affine rectification to restore affine-invariant properties of the scene, and
    • scale recovery which estimates vehicle speed in the real world.


Training and Data set details

  • Mask-RCNN that is implemented has been pre-trained on the MSCOCO dataset.
  • The DeepSORT algorithm used for vehicle tracking uses a pre-trained CNN, which extracts a set of appearance features for each bounding box detected. The details of the CNN can be found in Figure 3 of the paper.
  • The network is trained on the CompCars dataset which consists of 136,726 images for 163 car makes and 1,716 car models.


This paper achieved 7th position among all participating teams in the Track 1 of NVIDIA AI City Challenge