This post presents a walk through of an object detection process applied to Audio/Video receiver back panel images. [1]

Introduction

An audio/video receiver (AVR) is a consumer electronics component used in a home theater. Its primary purpose is to receive audio and video signals from a number of sources, processing them to drive loudspeakers and a display.

To enable schematics that include components such as AV receivers, we need a mapping of the I/O ports available on the receivers. Deriving this port catalog from the panel image and product manual is a laborious task.

The objective is to build a model which can detect different types of ports in an AV receiver back panel.

Approach

Dataset

A deep learning model needs hundreds of images of an object to train a good detection classifier. To train a robust classifier, the training images should have variety of backgrounds, lighting conditions and should be of different size and should vary in image quality along with desired objects. There could be some images where the desired objects are obscured, overlapped with something else, or only halfway in the picture.

We used  google search engine to download AV receiver back panel images of varied shapes and sizes.

Sample Image:

sample.png

Even though the industry standards are being maintained by most of the manufacturers, we did come across multiple instances where color schemes employed to code the connectors were totally different from usual.

Sample Image:

sample1.png

We collected total 75 images. Each ranging from 2.2kb to 1.7 MB and having high resolution close to 1928x836. We bifurcated the final dataset into training and test datasets.

Train imagesTest images
6510

We conducted experiments to detect 4 different types of components in the AV receiver back panel, namely -

hdmi.jpg
HDMI
video_component_small.jpg
Component Video
audio_in.png
Audio Input
pos_neg_small.png
Audio Output

After data collection, we used an open source tool to annotate different ports in the images. Annotation tool creates an XML file corresponding to each image, which contains all the coordinates information for each port.

Generating TF Records

Annotation XML files are transformed into TFRecord file format. A TFRecord combines all the labels (bounding boxes) and images for the entire dataset into one file. As the dataset is divided into train and test, there will be a separate TFRecord file for each. These record files are then fed into our Tensorflow model.

Model

TensorFlow model zoo provides many models along with their weights. From our research we found out that Mask-RCNN and Faster-RCNN are the current state of the art models for object detection. We have reused pre-trained weights of these models as a starting learning point (aka transfer learning) for our AV receiver port detection model.

Faster-RCNN: Faster R-CNN uses a CNN feature extractor to extract image features. Then it uses a CNN region proposal network to create region of interests (ROIs). We apply ROI pooling to wrap them into fixed dimension. It is then feed into fully connected layers to make classification and boundary box prediction.

Mask-RCNN: Similar to Faster-RCNN, it generates proposals about the regions where there might be an object based on the input image. After the ROI pooling, we add 2 more convolution layers to build the mask.

Below image gives a better visual representation of both the network architectures.

arch.png

Approach Flow Chart

flowchart.png

We experimented with the above mentioned models by downloading the pretrained weights of COCO dataset from TensorFlow model zoo.

ModelConfigurationsPrecision mAP(Test)Recall (Test)IterationsLoss(Train/Test)
Faster RCNN Inception ResNetAugmentation- random_rotation90,Optimizer - Adam, Stride-80.650.1630k0.04 / 0.68
Mask RCNNAugmentation- ssd_random_crop,Optimizer - momentum_optimizer, Stride-80.660.1630k0.37 / 0.6

Though the mAP/AR scores of both the models are in same range , we have found that Faster-RCNN generalized better when compared to Mask-RCNN.

Tuning Techniques

We have tuned the models with different configuration parameters which can be found below:

  1. Augmentation
  2. Optimizers.
  3. image_resizer.
  4. Different strides (8 and 16).

For our dataset we have seen better results with the below combinations.

  • Augmentation- random_rotation90
  • Optimizer - Adam
  • image_resizer - keep_aspect_ratio_resizer(min - 600, max- 1024)
  • Stride - 8

Loss graphs

tensorboard.png

Prediction

The generated object detection model is evaluated on test images. Below are the sample images that are detected by the model.

result1.png
result2.png

Results

results.gif

Conclusion

From our results we have observed that Faster-RCNN performed better than Mask-RCNN. It reaches an mAP of 0.65 and AR of 0.16 in 30k iterations. Though the datset size is small we were able to achieve decent accuracy scores.

Future Work

  • Add more types of ports detection, sub component detection, types of AV components.
  • Add more training data (generalizability is important, though)
  • Optimizations - hyper parameter tuning and data augmentation
  • Add Image text data along with its bounding boxes (by extracting it using OCR)
  • Use both object identification and text to improve precision and recall.

References

[1]. Annotation tool: https://github.com/frederictost/images_annotation_programme

[2]. Experiment Results: https://docs.google.com/spreadsheets/d/1TVhWb9ipoYXKEcvciwlKxT8rdm-Nsm8QHKEUdQIGog4/edit#gid=0

[3]. Tensorflow Object Detection API: https://github.com/tensorflow/models/tree/master/research/object_detection

[4]. Faster R-CNN Paper:
https://arxiv.org/abs/1506.01497

[5]. Mask R-CNN Paper :[https://arxiv.org/pdf/1703.06870.pdf]

[6]. Blog: https://3sidedcube.com/guide-retraining-object-detection-models-tensorflow/


This post is published in Jan 2019, but presents work done in Aug 2018. ↩︎