Introduction

The extraction of relevant information from unstructured documents is one of the most important tasks in Natural Language Processing (NLP). Named Entity Recognition (NER) is a specific task of NLP which aims at extracting entities like Person, Location, Date etc. There are a number of applications of NER and it can be helpful to solve problems related to domains like Medical, Education, Legal etc. Some of the most common use cases are listed below:

  • Classifying digital contents
  • Efficient search algorithms
  • Content recommendation
  • Customer support

Approaches to entity extraction span a broad range from linguistic rule based to machine learning based, and some hybrid as well. NER is a widely studied topic in text mining research, and many new challenges are seen in domain-specific applications, such as biomedical, retail, legal, insurance etc.

The dictionary based method is a common technique as it play a key role in understanding the text. Most dictionary based NER systems focus on: (1) integrating and normalizing different biomedical databases to improve the quality of the dictionary to be used; (2) improving matching strategies that are more suitable for biomedical terminologies; and (3) making filtering rules for post processing to refine the matching results or to adjust the boundary of entities. Many information extraction systems had a dictionary matching module to perform preliminary detection of named entities. People have tried image segmentation based approaches as well like chargrid (Katti et al., 2018) or BERTgrid (Denk et al., 2019).

Applying machine learning techniques generally obtains superior performance for NER task. The automated learning process can induce patterns for recognizing entities and rules for pre and post processing. Generally speaking, there are two categories of machine learning based methods: one treats NER as a classification task, while the other treats NER as a sequence labeling task. The sequence labeling task can be approached by Hidden Markov Model (HMM), Conditional Random Field (CRF) , or a combination of different models. Since proposed in Lafferty et al., 2001, CRF has been applied to many sequence labeling tasks.

Problem statement

Extracting important information from documents as such as invoices and receipts, as shown in image below, by using layout, structure and content of the text.

Most of the important information in invoices will be in the form of key value pairs and tables. The keys can be on the left or on top of the values like in tables. The tag of a word might depend on the features of the words above and below it. The linear features won't be able to utilise these relevant features. Most of the techniques focused on entity extraction using CRF try to leverage the linear features, but in case of invoices, these methods misses some of the extremely important features. So, we have introduce the spatial features to make use of these extremely important features. For example, the most important feature for client name (Vishwas Anand) and seller name (Lunchbox) are  Delivery To and Ordered from respectively as shown in the image above.

Data preparation

We tagged around 1400 documents of which 1000 were used for training and 400 for validation.

For tagging, we used  Labelimg which helps us to assign a class to a rectangular region and gives an xml file for the same.

Then, we used Tesseract to get words and their bounding boxes in .tsv format.

Using the xml files and their corresponding tsv files, we assigned a particular class to each word i.e. all words inside a particular rectangle area will be assigned to one class.

The images used in the blog are dummy images and are similar to the dataset used for training the model. We have tagged 15 classes in the documents and datasets include around 39 different templates.

Implementation

We have used linear-chain crf with spatial features. Before moving to the implementation, let's explore what is CRF, how it works, and what is the importance of spatial features in extracting named entities from invoices.

What CRF is and how it works?

Definition: Let G be a factor graph over X and Y . Then (X,Y) is a conditional random field if for any value x of X, the distribution $p(\mathbf{Y} | \mathbf{X})$ factorizes according to G.

It is a class of Discriminative models used for prediction tasks. The idea is similar to logistic regression where we try to model $p(\mathbf{y} | \mathbf{x})$, but with a trick. Rather than using one weight vector per class, a single set of weights is shared across all the classes. The idea is to define a set of feature functions that are non zero only for a single class. To build the conditional field, we next assign each feature function a set of weights (lambda values), which the algorithm is going to learn. The model equation is described below:

$ p_{\theta}(y | x)=\frac{\exp \left(\sum_{j} w_{j} F_{j}(x, y)\right)}{\sum_{y^{\prime}} \exp \left(\sum_{j} w_{j} F_{j}\left(x, y^{\prime}\right)\right)} $, where $ F_{j}(x, y)=\sum_{i=1}^{L} f_{j}\left(y_{i-1}, y_{i}, x, i\right) $

Spatial features and their importance

An invoice is a type of report designed specifically to make it easy for a human to understand. is not constructed of coherent text lines, but mostly contains tables and small text blocks. For most important information, there is some hint associated. It might be the words to its left or to its right or it might be words above and below it. To capture the words above and below, we have introduced spatial features. It helps us utilize the dependency of a class to its surrounding words.

Let's move to the steps we followed to solve this problem:

  • First we cleaned the dataframes by removing the rows, if:
    • Tesseract confidence is less than 30%
    • Height of a word is greater than four times the average height of words or less than half of average height
    • Word width is greater than one third the image width or rows with empty strings or null values.
  • Initially, even after sorting the words by top and left, the words were out of sequence. It is because the difference in the heights of characters i.e. 'i' and 'f' have different heights or it might be because of distortions in the image. So, we assigned line numbers which involves two steps:
    • Get the connection between words based on lowest distance and lowest angle between them and for each word, store the next connected word, horizontal closest distance and line_number as horizontally_closest_features.
    • Sort it on the basis of top and left and give same line number to all the connected words by iterating the dataframe.
df = df.sort_values(by=["top", "left"])
line_no = 1
for idx, row in df.iterrows():
    if df.at[idx, "line_no"] >= 1: # ignore the assigned lines
        continue
    horizontally_closest_word = row["horizontally_closest_word"]
    df.at[idx, "line_no"] = line_no
    while( len(horizontally_closest_word) > 0 ) # ignore words to the left of current word
        # assign line number
        df.at[horizontally_closest_features[-1], "line_no"] = line_no
        horizontally_closest_word = df.at[horizontally_closest_features[-1], "horizontally_closest_word"]
    line_no += 1
df.sort_values(by=["line_no", "left"])

  • We then get Spacy features like labels, pos tags and whether a word is present in vocabulary or not using en_core_web_lg model of Spacy.
  • Then we extracted spatial features. To do that, we take the center of each word as origin and take a bounding box around the word with width as width of the invoice image and height almost thrice the height of the word, both above and below the origin. We then divide the bounding box into six quadrants each with angle 60° as shown in the image below and store the nearest word and its features for each quadrant i.e. distance and angle from center, pos tag, label for each quadrant form the current word.
  • We then get Spacy features like labels, pos tags and whether a word is present in vocabulary or not using en_core_web_lg model of Spacy.
  • Then we extracted spatial features. To do that, we take the center of each word as origin and take a bounding box around the word with width as width of the invoice image and height almost thrice the height of the word, both above and below the origin. We then divide the bounding box into six quadrants each with angle 60° as shown in the image below and store the nearest word and its features for each quadrant i.e. distance and angle from center, pos tag, label for each quadrant form the current word.

We then created features for each word as crfsuite takes dictionary of features as input for each word. The features used are listed below:

  • Simple features: It includes basic features like if there is digit in the word or not, whether it starts with capital letter or small letter, whether there is a special character in it, pattern of first for characters etc.

  • RegEx features: It includes features like weather its a part of a date, price, phone number, email etc.

  • Linear features: It includes features of the previous three and next three words to the current word. Here we we have taken words that has the same line numbers as the current word.

  • Spatial features: These features includes surrounding words and their features for each word.

We then trained the model using a library called sklearn crfsuite which uses Limited-memory BFGS (L-BFGS)  optimization algorithm and used Sklearn RandomizedSearchCV to find best c1 (the coefficient for L1 regularization) and c2 (the coefficient for L2 regularization). The F1 score without spatial features is 84.6% but with spatial features it increases to 90.3%.

The flat classification report for model with spatial features looks like:

Conclusion

Invoice-NER has more difficulties than normal NER because of the fact that invoices are not constructed of coherent text lines, but mostly contain tables and small text blocks. So, regular NLP algorithms won't work.

In this blog, we have explained how linear-chain crf with spatial features can be used to extract named entities from invoices. The linear-chain crf models the dependencies between tags of the previous and next words. However, in case of invoices, the tags of the words which are above and below a word are also dependent. The 2D-CRF like Grid CRF might capture these dependencies. One of the interesting direction will be trying it.

There are other approaches that might be worth trying:

  1. Instead of manually creating features, we can use LSTM with positional embedding (which might able to figure out some features which we haven't figured out yet) and use CRF on top of that
  2. Segmentation-based approach like Chargrid or BERTgrid

References

An Introduction to Conditional Random Fields

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Chargrid: Towards Understanding 2D Documents

BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding