How to select the perfect CNN Back-bone for Object Detection? — A simple test

Published in

Analytics Vidhya

5 min readAug 4, 2020

Hi! So you’re interested in using a Convolutional Neural Network (CNN) to solve Computer Vision (CV) problems. Carry on if you’re working with a Classification, Detection or Segmentation model. Ideally, the trick I describe (with code) here should work with any Deep Learning (DL) model which uses a Convolutional Feature Extractor (CONV-FE). (“Feature Extractor” is sometimes also referred to as “Back-bone” or “Encoder”)

With CNNs growing deeper and wider by the day, visualization is often the only way for developers to interpret their models. Whether you’re trying to figure out why your model is not working or you’re trying to validate the performance of your model or you’re trying to choose the best CONV-FE for your pipeline, visualization is often the closest substitute for a text-book solution.

Today we’re going to see a neat trick to visualize the CONV-FE for any CV task. We’ll use Keras with TensorFlow backend to illustrate the trick. This trick can be used to “debug” your model that’s not training properly or to “inspect”/“interpret” a trained model.

For new readers, it’s worth mentioning the difference between an architecture and a CONV-FE. SSD, YOLO and F-RCNN are some popular Object Detection architectures today. As you know, architectures are templates with some form of flexibility which the user can choose to tune the architecture for his use-case. For new readers, it’s worth mentioning the difference between an architecture and a back-bone. CONV-FE is one thing that most architectures treat as a plug-in. Some of the popular CONV-FEs are VGG-16, ResNet50 and MobileNetV2. Often it is seen that the choice of CONV-FEs will greatly affect the performance (both speed and accuracy) of a model built with a particular architecture. For this reason, practitioners mention the CONV-FE they used in the model name itself. An F-RCNN(ResNet50) is a model with the architecture described in F-RCNN paper built using a ResNet50 CONV-FE. In other words: F-RCNN(MobileNetV2) and F-RCNN(VGG-16) follow the same Network architecture but use different CONV-FEs. For this reason, we can expect different performance from each system. CONV-FEs also have their architectures which is the reason why we have VGG-16 as well as VGG-19 but that’s a story for another day!

***Figure 1:*** Template for a generic Object Detection algorithm. This fits for YOLO, F-RCNN, SSD, etc.

Figure 1 shows a high-level architecture of a DL powered Object Detection System. Well-known architectures such as YOLO, F-RCNN and SSD can be generalized with this template. [W, H, 3] denote the input width and height of the image also denoting that the input image has 3-channels (aka RGB input). [w, h, N] denote the output dimensions of the CONV-FE. Note that since N is the number of filters in the last CONV layer of the feature extractor, it is usually a large number (for VGG-16, N = 512). w and h are almost always smaller than W, H respectively since CONV blocks are usually followed by Pooling layers in most architectures today.

As mentioned above, the choice of a CONV-FE is crucial for achieving the best performance for our CV system. However, the problem lies within the high value of N. It not possible to visualize a 3-D tensor with all its dimensions arbitrarily large! So that’s the problem we’re going to attack today. Let’s start without any further ado. We will use Google Colaboratoy for simplicity.

Let’s being by selecting a CONV-FE. How about VGG-16? Let’s try and load it.

Snippet 1: Script to load the CONV-FE

Once the model is loaded, we need to do some boiler-plate scripting to load the images that we want to try out our trick on. I am not describing those steps here since they will vary greatly on your particular use-case. (However, they are implemented in the code with comments if you insist). For now, we just assume that all the required images are stored in a directory called “images” located directly inside our working directory. Let’s now use the trick!

Snippet 2: The code to load the images, run inference on them and visualize the CONV-FE

You will see output resembling below (These will also be saved in the “results” directory created directly inside the working directory)

The second column shows the pixel-wise maximum that any channel could achieve in the output feature map resulting from the CONV-FE. The third column shows the average of all the channels. We say that the CONV-FE is suitable for our dataset if the second and third columns in our results show clear feature separation.

As we can see, Results in 1 and 2 are pretty nice but Results in 3 and 4 suggest the current model is not really suitable if you have full-screen PCB images or images of Indian men in traditional attire. However, this model seems to be a good choice if your dataset contains pictures of animals in the wild or cars/vehicles.

Now, let’s try to add a small justification for lines 25 and 26 from Snippet 2. According to Figure 1, the output feature map can have any number(N) for the output channels. This is the principal reason for our inability to visualize and compare CONV-FEs. We use common aggregation techniques viz. sum and average to get an overall approximation for the quality of the feature-map. I have not aggregated using the pixel-wise median along the channels. Can you reason why in the comments? Can you tell how the output would look like if I added a fourth median column? (Hint: You can use np.median() to find median along any axis of any numpy array)

Github link to full-code: https://github.com/IshanBhattOfficial/Vizualize-CONV-FeatureExtractors

Hope this was helpful!

How to select the perfect CNN Back-bone for Object Detection? — A simple test

Written by Ishan Bhatt