Wieland Brendel, Matthias Bethge (2018)
Approximating CNNs with Bag-of-Local Features Models Works Surprisingly Well On ImageNet

Dataset and Model weights: ImageNet 2014

This paper presents an approach to approximate CNNs with Bag-of-Local feature like models by modifying the DNN architecture (VGG, ResNet-50) that behave similar to their DNN counterparts. Thereby, increasing the interpretability and observing what features do CNNs pay attention to in context of image classification problem.

### Architecture

• The input image is divided into patches of q$\times$ q pixels.

• All 3 $\times$ 3 convolutions are replaced by 1 $\times$ 1 (except spme layers in ResNet-50), to limit the receptive field of the topmost layer to q $\times$ q.

• A 2048 dimensional feature representation is inferred from each image and a linear classifier is applied for class probabilities for each patch. And the class probabilities are averaged across all patches to infer the image-level class probabilities (logits).

### Key Points

• ImageNet can be solved using only a collection of small image features. Long-range spatial relationships like object shape or the relation between object parts can be completely neglected and are unnecessary to solve the task.

• It has been observed that locally correlated patches are enough for the network to make the decision and this mechanism has made the decision making process transparent.

• Standard CNNs and BagNets are sensitive to similar features and this mean even the misclassifications are similar.

• Decisions are invariant against spatial shuffling of image features.

• Although modifications of different image parts is independent, their effect on the total class evidence is same.

• It has been observed that CNNs have texture bias.