Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis (2016)
Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval


The experiments in this paper show that CNNs trained on natural images such as ImageNet can transfer well to document analysis tasks like document image classification and document retrival.

Document classification using AlexNet

  • Considers a classification task on a document image dataset namely RVL-CDIP consisisting of 4 lakh documents with 16 classes as shown in the above figure.

  • Uses a AlexNet pre-trained on ImagNet dataset for transfer learning and achives an accuary of about 90 percent on RVL-CDIP.

  • Studies whether enforcing region-specific feature-learning for CNNs is useful for improving document classification by using an ensemble of CNNs as shown below.

  • Specifically 5 CNNs are used which take the a) Whole image, b)Header, c)Footer, d)Left Body, e)Right body of the documents.

  • Observes that there is no significant improvement in explicitly enforcing such region-specific features compared to just having one CNN where the whole image is passed. There by concludes that CNNs seem to inherently learn those implicitly.


Document Image Retrieval

  • Given a test document, retrieval was performed by computing the Euclidean distance between the test document descriptor and every descriptor
    of the training set.

  • The sorted distances were then used to rank the images of the training data, and return a sorted list of documents for each test query.

  • A retrieved document is said to be relevant if it belongs to the same class as of the test document. Results obtained for few test samples are shown below.


  • Query images are shown in the first column, and the top ten retrievals are shown in the following columns in order. Retrievals from the same class are shown with a green border; retrievals from a different class are shown with a red border. Retrievals from other classes are considered incorrect, but they are often good retrievals nonetheless.

Datasets used