Approximate biometrics are often required for effective online shopping experience, for example, for clothing, eyewear and footwear. We experimented with Mask-RCNN based object segmentation for measuring human feet with the intention of recommending appropriate footwear, which we talk about in this post.

Table of contents

  1. Introduction
  2. Study
  3. Approach
  4. Dataset Collection
  5. Model
  6. Loss graphs
  7. Prediction
  8. Metrics
  9. Aligning output masks
  10. Methodology for calculating foot measurements
  11. Length and Breadth
  12. Arch type
  13. Results
  14. References


For online shopping, customers need to know their own body measurements in order to make the final purchase order. When it comes to footwear, the most commonly used metric for choosing footwear is the heel to toe length. Along with Length if customer has Width and Arch type (High, Medium or Low) too, it is possible to offer better model recommendations.

Recently, deep learning based computer vision has provided a set of flexible tools that we can apply to problems of this kind. We experimented with Mask-RCNN based object segmentation applied to an image of a wet footprint with the intention of capturing the parameters to recommend the right footwear model and size.


For our problem, we need to isolate the foot impression and then measure its attributes. Therefore we looked at Object detection techniques, which constitute instance segmentation and classification within the input image.

Some recent architectures that address this kind of a problem are RCNN, Fast-RCNN, Faster-RCNN with the latest one in the line being Mask-RCNN by Facebook AI Labs.

Mask-RCNN extends Faster-RCNN by adding a branch for predicting an object mask parallel to the existing branch for bounding box recognition. It detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. It is designed for pixel-to-pixel alignment between network inputs and outputs.


It uses ResNet (50, 105, 152), ResNeXt(50, 105, 152) or VGG(16, 19) as a backbone network architectures. Mask-RCNN got tested on MS COCO dataset. Mask RCNN has outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.


Data Collection

BG Data

We collected images with impressions of wet feet on different backgrounds (Floor, newspaper, A4 sheet, brown and sky coloured sheets) as mentioned in Fig-2.  We needed ways to get impressions of feet as well as an in-image reference to determine the scale of the impression in the image. We wanted to accomplish this using materials that are highly likely to be available at hand.

So, we choose credit cards or their equivalent, since their dimensions are standardized. For manually annotating foot impression and credit-cards in our image data set, we used VGG Image Annotation tool.


Images may have impressions slightly rotated w.r.t. the image's orientation(see Fig-3 (a)). If we don’t correct this rotation, we would be measuring projected length of the foot. Shown image has been rotated anti clockwise with 2.726 degrees (see Fig-3(b)).

The reference card object may also appear at arbitrary angles. While training we don’t correct these rotations. After prediction, we align both the detected objects according to our requirements. See section Aligning output masks.


Once we collected and annotated our dataset of foot impressions, we divided it into training, validation and testing groups.

DatasetImages size

Reasons of using small Data-set:

For generating the required data for transfer learning on Mask-RCNN network, we collected footprint images of both left and right feet of around 60 people. We augmented this dataset with automated transformations in order to reduce the chance of overfitting. We started with a smallish dataset with the view of evaluating dataset expansion at a later stage if found to be necessary.

Applied Augmentations:

  • Rotation (clock and anti-clock)
  • Gausian Blur
  • Brightness
  • Scaling
  • Constrast

Loss graphs






Aligning output masks

As described in the Dataset section, we must correct the rotations present in the foot impression and the card. The inputs are individual foot and card object masks that are predicted by the model.

Predicted mask

We computed the PCA (Principal Component Analysis) for the individual masks using OpenCV. The outputs of the PCA give us the centre of the target and the eigenvectors. The first eigenvector always points in the direction of highest variance in the input data and the second eigenvector is always orthogonal to the first eigenvector. The angle between the first eigenvector and the coordinate axis gives us the rotation correction to be made.

Methodology for calculating foot measurements

Length and Breadth

After getting the aligned masks, the target image is projected onto x and y axes to find its length and breadth. To find the length, we calculated the difference between the largest and the smallest y-coordinate and calculated breadth in the same manner. This gives us the length and breadth in pixel dimensions.

The real world dimensions of a credit card are 85.60 × 53.98 mm. We use this reference to setup the conversion from pixel dimensions to real world dimensions. By multiplying the realworld2pixel ratio with foot pixel dimensions we get the real world dimensions of the foot impression. We found that the predicted values were pretty close to the ground truth.


Arch type

To predict the foot arch type we use the following formula from KH Su et al

Calculated length from heel to ball of foot as 0.83 X foot length (We did a regression model to get an estimate of heel to ball of the foot on given foot length) and then divided into 3 equal parts in length and the area of each section from the top is named as A, B and C respectively. This is illustrated in the picture.

The Arch Index(AI) was categorized into three types as

  • High arch (AI≤0.21)
  • Normal arch (AI between 0.21 and 0.26) and
  • Low arch (AI≥0.26)

We relied on this known approach to measure the arch as our reference.


Foot measurements of given input (foot impression).