Anoop Raveendra Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, Jean Baptiste Faddoul (2018)
Chargrid: Towards Understanding 2D Documents
This paper introduces a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. This representation has been used for an information extraction task from invoices and is shown to significantly outperform approaches based on sequential text or document images.
- Each character in the document is mapped to a unique integer (54 distinct characters, one hot encoded) and is represented using a unique color in char-grid representation
- All the pixels that a particular character occupies in the image is replaced by the corresponding color.
- The encoded chargrid image is passed through a semantic segmentation pipeline with an encoder-decoder style architecture as shown in the figure
- Decoder does pixel-wise segmentation of chargrid image into the required invoice fields
- 5 header fields: (Invoice number, Date, Total Amount, Vendor name, Vendor address)
- 3 line-item fields: (Description, Qty, Line amount)
- Totally 9 classes for segmentation including background
The following evalution metric which measures how much work is saved by extraction system compared to manual extraction of fields from invocies used. Specifically it counts the number of insertions or deletions or modificatins required to manually correct the fields extracted by the proposed invoice extraction system.
The performance comparision of the models that use only text (sequential) information extracted from OCR, that use only scanned documents without using OCR (image-only) and the current chargrid representation is shown in the table below.
Chargrid performs significantly better than the sequential model for line-item based fields such as description, qty, line amount (since they require visual features)
- Private dataset with 12000 invoice documents
- The invoices in the dataset do not have a unique format. They span over a large number of different formats.
- At most there are 6 invoices for a given format