Audio/Video receiver ports detection using Deep Learning

This post presents a walk through of an object detection process applied to Audio/Video receiver back panel images. 1

Table of contents

  1. Introduction
  2. Approach
    1. Dataset
    2. Generating TF Records
    3. Model
    4. Tuning Techniques
    5. Loss graphs
    6. Prediction
  3. Results
  4. Conclusion
  5. Future Work
  6. References


An audio/video receiver (AVR) is a consumer electronics component used in a home theater. Its primary purpose is to receive audio and video signals from a number of sources, processing them to drive loudspeakers and a display.

To enable schematics that include components such as AV receivers, we need a mapping of the I/O ports available on the receivers. Deriving this port catalog from the panel image and product manual is a laborious task.

The objective is to build a model which can detect different types of ports in an AV receiver back panel.



A deep learning model needs hundreds of images of an object to train a good detection classifier. To train a robust classifier, the training images should have variety of backgrounds, lighting conditions and should be of different size and should vary in image quality along with desired objects. There could be some images where the desired objects are obscured, overlapped with something else, or only halfway in the picture.

We used google search engine to download AV receiver back panel images of varied shapes and sizes.

Sample Image:


Even though the industry standards are being maintained by most of the manufacturers, we did come across multiple instances where color schemes employed to code the connectors were totally different from usual.

Sample Image:


We collected total 75 images. Each ranging from 2.2kb to 1.7 MB and having high resolution close to 1928x836. We bifurcated the final dataset into training and test datasets.

Train images Test images
65 10

We conducted experiments to detect 4 different types of components in the AV receiver back panel, namely -

Name Image
HDMI hdmi.jpg
Video Component video_component_small.jpg
Audio Component Input audio_in.png
Audio Output pos_neg_small.png

After data collection, we used an open source tool to annotate different ports in the images. Annotation tool creates an XML file corresponding to each image, which contains all the coordinates information for each port.

Generating TF Records

Annotation XML files are transformed into TFRecord file format. A TFRecord combines all the labels (bounding boxes) and images for the entire dataset into one file. As the dataset is divided into train and test, there will be a separate TFRecord file for each. These record files are then fed into our Tensorflow model.


TensorFlow model zoo provides many models along with their weights. From our research we found out that Mask-RCNN and Faster-RCNN are the current state of the art models for object detection. We have reused pre-trained weights of these models as a starting learning point (aka transfer learning) for our AV receiver port detection model.

Faster-RCNN: Faster R-CNN uses a CNN feature extractor to extract image features. Then it uses a CNN region proposal network to create region of interests (ROIs). We apply ROI pooling to wrap them into fixed dimension. It is then feed into fully connected layers to make classification and boundary box prediction.

Mask-RCNN: Similar to Faster-RCNN, it generates proposals about the regions where there might be an object based on the input image. After the ROI pooling, we add 2 more convolution layers to build the mask.

Below image gives a better visual representation of both the network architectures.


Approach Flow Chart:


We experimented with the above mentioned models by downloading the pretrained weights of COCO dataset from TensorFlow model zoo.

Model Configurations Precision mAP(Test) Recall (Test) Iterations Loss(Train/Test)
Faster RCNN Inception ResNet Augmentation- random_rotation90,Optimizer - Adam, Stride-8 0.65 0.16 30k 0.04 / 0.68
Mask RCNN Augmentation- ssd_random_crop,Optimizer - momentum_optimizer, Stride-8 0.66 0.16 30k 0.37 / 0.6

Though the mAP/AR scores of both the models are in same range , we have found that Faster-RCNN generalized better when compared to Mask-RCNN.

Tuning Techniques

We have tuned the models with different configuration parameters which can be found below:

  1. Augmentation
  2. Optimizers.
  3. image_resizer.
  4. Different strides (8 and 16).

For our dataset we have seen better results with the below combinations.

  • Augmentation- random_rotation90
  • Optimizer - Adam
  • image_resizer - keep_aspect_ratio_resizer(min - 600, max- 1024)
  • Stride - 8

Loss graphs

Faster RCNN Tensorboard Graphs tensorboard.png


The generated object detection model is evaluated on test images. Below are the sample images that are detected by the model.






From our results we have observed that Faster-RCNN performed better than Mask-RCNN. It reaches an mAP of 0.65 and AR of 0.16 in 30k iterations. Though the datset size is small we were able to achieve decent accuracy scores.

Future Work

  • Add more types of ports detection, sub component detection, types of AV components.
  • Add more training data (generalizability is important, though)
  • Optimizations - hyper parameter tuning and data augmentation
  • Add Image text data along with its bounding boxes (by extracting it using OCR)
  • Use both object identification and text to improve precision and recall.


[1]. Annotation tool:

[2]. Experiment Results:

[3]. Tensorflow Object Detection API:

[4]. Faster R-CNN Paper:

[5]. Mask R-CNN Paper :[]

[6]. Blog:

  1. This post is published in Jan 2019, but presents work done in Aug 2018. [return]
Reddy Anil Kumar avatar
About Reddy Anil Kumar, "Anil"
Data science and engineering team.
Gaurish Thakkar avatar
About Gaurish Thakkar, "Gaurish"
Data Science and Engineering team.
Varun Gagneja avatar
About Varun Gagneja, "Varun"
Data science and engineering team.