Wieland Brendel, Matthias Bethge (2018)
Approximating CNNs with Bag-of-Local Features Models Works Surprisingly Well On ImageNet

Paper: https://openreview.net/pdf?id=SkfMWhAqYQ

Dataset and Model weights: ImageNet 2014

This paper presents an approach to approximate CNNs with Bag-of-Local feature like models by modifying the DNN architecture (VGG, ResNet-50) that behave similar to their DNN counterparts. Thereby, increasing the interpretability and observing what features do CNNs pay attention to in context of image classification problem.

Architecture

  • The input image is divided into patches of q$ \times$ q pixels.

  • All 3 $\times$ 3 convolutions are replaced by 1 $\times$ 1 (except spme layers in ResNet-50), to limit the receptive field of the topmost layer to q $\times$ q.

  • A 2048 dimensional feature representation is inferred from each image and a linear classifier is applied for class probabilities for each patch. And the class probabilities are averaged across all patches to infer the image-level class probabilities (logits).

Key Points

  • ImageNet can be solved using only a collection of small image features. Long-range spatial relationships like object shape or the relation between object parts can be completely neglected and are unnecessary to solve the task.

  • It has been observed that locally correlated patches are enough for the network to make the decision and this mechanism has made the decision making process transparent.

  • Standard CNNs and BagNets are sensitive to similar features and this mean even the misclassifications are similar.

  • Decisions are invariant against spatial shuffling of image features.

  • Although modifications of different image parts is independent, their effect on the total class evidence is same.

  • It has been observed that CNNs have texture bias.

bof-important-patches

Comments

  • From this paper we can see that trying to capture long range relationships by using deepnets is not necessary as
    the CNNs are only capturing locally correlated features.

  • This might explain why adversarial perturbations in general are picked up reliably by CNNs no matter where they are placed.

  • CNNs don't capture object level details as we humans do as there is no incentive for them learn as the task can be solved using local features.

  • This opens up a new direction of designing objective functions to make CNNs learn physics of the natural world.