Document image classification is not as well studied as natural image classification. We experimented with different neural network architectures on document image dataset.  We discuss our preliminary results in this post.


Document classification is a vital part of any document processing pipeline.  It helps us segregate documents into different groups which need to be processed in different ways. Classification is generally done using only textual data.  We wanted to study whether a given set of documents can be classified based on their visual layout or structural properties without relying on the textual information.

Convolutional neural networks (CNN) require large dataset to achieve high accuracy.  RVL-CDIP is a dataset that consists of 400000 scanned grayscale images of documents which belong to 16 different classes. Harley et al introduced this dataset and presented a baseline performance using CNNs. Tensmeyer et al studied how different data transformations (augmentations) affect the test set performance of the network. They have also tried appending different image features such (SIFT, SURF etc) and found that SIFT features perform the best.  Das et al proposed a new architecture, where the image is split into multiple regions and each of them are passed to different VGG-16 networks and the features from all the networks are passed on to a meta classifier which does the final prediction.


The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images, 40,000 validation images, and 40,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

Here are the classes in the dataset, and an example from each:



We have used VGG-16, ResNet-18 and ResNet-34 networks throughout the experiments. Training from scratch gave only gave us 71.9% accuracy. As transfer learning has worked wonders on several computer vision tasks we have used imagenet pretrained weights to train our models on this dataset.  Here we present results of ResNet-34 for the sake of brevity.

First we only trained the last linear layer by freezing all the other layers and without any data augmentation. This gave us accuracy of 73%. Next we trained the same model with all layers unfreezed and with the following data augmentations: horizontal and vertical flip, random rotation (+/- 10 degrees), random scaling, affine warp. The resuls are shown below.

epochtraining lossvalidation lossaccuracy

The model has been trained for ten more epochs and neither training loss nor validation loss decreased. We can conclude that the model has converged.

As the training progressed we kept track of those classes the model is most confused about, there is an interesting observation. The number of samples in top-10 most confused classes didn't change as the training progressed. The table consists of top-10 most confused classes at 2 epochs and 17 epochs.  The tuple's structure is ('original_class', 'predicted_class', number of samples).

Epoch 2Epoch 17
('12', '5', 155)('12', '5', 157)
('5', '12', 152)('5', '12', 122)
('5', '10', 120)('9', '6', 98)
('9', '12', 109)('1', '11', 89)
('11', '1', 106)('5', '1', 85)
('1', '11', 104)('11', '10', 83)
('9', '6', 103)('5', '10', 81)
('15', '0', 98)('11', '1', 80)
('11', '10', 97)('9', '12', 77)
('10', '11', 95)('10', '11', 74)

There are three possibilities that causes this behaviour namely,

  1. Mislabelled samples
  2. Ambiguous data samples. It has been observed through visual inspection that some of samples contain information suggesting it can belong to multiple classes.
  3. Model has reached its limit. The model can no longer explain the variance in the data in its current state.

Class activation map visualization

We have used Grad-CAM by Selvaraju et al to view which regions of the image that the model is attending to, that resulted in the classification.

The examples shown below are correct prediction.


These are the examples that are incorrect predictions.


The first image is predicted as form because it actually is a form and it was mislabelled as letter and it can also be classified as handwritten. The second is predicted as advertisement because lot of examples in advertisement class had pictures with text  surrounding them.

Experiments on in-house dataset

We finetuned the model that is trained on RVL-CDIP dataset on our in-house dataset. The training procudure is same as the above. First we trained without data augmentations, the accuracy was 91.5% and when trained with data augmentations, the model converged in 3 epochs and the accuracy was 97.1%.

Recently we came across a paper by Katti et al. They proposed a new 2D representation of documents where each character in unicode is assigned a color in RGB space and the pixles occupied by the charecter in the document are filled with the corresponding color.  This representation encodes both textual and spatial information. Katti et al didn't report any experiments on how different choices in the color spaces affected their results.

So we tried this representation on our in-house dataset. We tried two representations where RGB values are assigned randomly and one representation in grayscale. Surprisingly, this change in representation didn't affect the accuracy at all. All representations resulted in the same accuracy of 97.1%. The Grad-CAM output confirmed that the model is learning similar patterns across the respective classes. Further investigation on larger datasets is needed for this inference on colorspace choices to be conclusive.

The Grad-CAM heatmap for images belonging to the same class.


Future work

It has been mentioned in Harley et al that the dataset is originally multi-labelled and the class labels are assigned on the basis of whichever class their model has predicted has the most probability. And as mentioned above there is evidence of mislabelled samples and ambiguous data samples. The present framework doesn't deal with such samples effectively.  Our future work would be analyzing this dataset using Bayesian framework, which allows the model to be uncertain with such samples.


A. W. Harley, A. Ufkes, K. G. Derpanis, "Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval," in ICDAR, 2015

Ramprasaath R. Selvaraju et al, "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization," in ICCV, 2017

Tensmeyer et al, "Analysis of Convolutional Neural Networks for Document Image Classification," in IAPR, 20117

Das et al, "Document Image Classification with Intra-Domain Transfer Learning and Stacked Generalization of Deep Convolutional Neural Networks," in ICPR, 2018

Katti et al, "Chargrid: Towards Understanding 2D Documents," in EMNLP, 2018