In this post, we discuss about the 2-dimensional CRF (Conditional Random Field) based machine learning model that we have built for named entity recognition task from invoice document images. This blog post builds upon our previous blog on linear-chain CRF.
Given an image of an invoice document, the aim is to predict certain entities such as Invoice number, Address of the Client, Phone number, Total Amount, Tax etc.
How is 2D CRF different from linear chain CRF?
- The main difference between the 2D CRF and the linear chain CRF is that in the 2D CRF, the tag of a given word is assumed to be dependant on the tags of the neighbouring words in all the directions (For example as shown in the figure above, the tag of a given word is assumed to be dependant on six surrounding words—the closest ones in each quadrant) . However, in a linear chain CRF, the tag of a word is assumed to be dependant on the word that is to the immediate left of it and on the word to the immediate right of it.
- With respect to the implementation, except for few changes in the data preparation part, the pipeline used for 2D CRF is similar to the linear-chain CRF pipeline which is described in our previous article on linear-chain CRF.
Data preparation for 2D CRF
A document is represented using the following information.
Node Features: Each word in the document is represented using certain features like is_digit, is_alphanumeric, is_date, word_pattern, word_lower etc. The specific descriptions of these features can be found in our linear-chain CRF blog. It may be noted that any string feature is converted into a binary vector using one-hot encoding.
If there are $N$ words in a document and each word has $K_v$ features, a numpy array of dimension $N*K_v$ is used to specify the word features.
Edge List: In addition to node features, for each document, we have to specify a list of edges which connect neighbouring words. In our model, for every node we have added $6$ edges which connect them with their surrounding words as shown in the above figure. In contrast for a linear-chain CRF, each word has only two edges which connect it to the previous word and the next word.
The edge list is given in the form of a list of tuples, where each tuple (i,j) contains the indices of the nodes corresponding to an edge.
Edge Features: In addition to node features, we can also assign features for every edge in the edge list. Features such as the distance between the words that are connected by that edge, the direction in which the edge is present can be added. The distance feature might indicate how important is the tag of a neighbour while determining the tag of a given word. For example, the surrounding words that are farther might have less influence on the tag of that given word. Also, the direction in which the surrounding word is present (for example, whether it is on the top of the given word or to the left of the given word) might be useful in understanding the relation between the words.
If there are $M$ edges in a document and each word has $K_e$ features, a numpy array of dimension $M*K_e$ is used to specify the edge features.
Tag list: The tag (class) of each word in the document is given as a list.
We have used a package called PyStruct for implementing the 2D CRF. There are mainly two variants of the 2D CRF models supported: (i) GraphCRF, (ii) EdgeFeatureCRF. Depending on the problem nature, we should select one of these two models.
GraphCRF: This model assumes that all the edges are homogeneous (similar) and hence does not consider any features for the edges. In other words, a document is represented using only the node features and the edge list.
The number of learnable parameters of this model is $K_v C + C^2$, where $C$ is the number of classes (named entities that we consider) and $K_v$ is the number of node features. Here, the $K_v C$ parameters capture the contributions of the $K_v$ node features to all the different $C$ classes. The $C^2$ parameters capture how likely is it for a pair of classes to be assigned to the two ends of an edge.
EdgeFeatureCRF: This model treats each edge distinctly and hence considers the edge features too to represent a document. The number of learnable parameters of this model is $K_v C + K_e C^2$. Further, there is an option to specify whether a given edge feature is symmetric. For example, the distance feature of an edge is a symmetric feature.
The Pystruct supports different learning algorithm to train the CRF models. OneSlackSSVM, SubgradientSSVM, NSlackSSVM are few of them. While the technical details of the algorithms is beyond the scope of this blog, the following are some guidelines provided to choose one of these algorithms.
SubgradientSSVM : Good for many datapoints, fast inference. Usually worse than FrankWolfeSSVM, but takes less memory. Good for obtaining reasonable solutions fast.
NSlackSSVM : Good for very few datapoints (hundreds), slow inference. Good for obtaining high precision solutions. This algorithm also supports selecting batch size while training.
OneSlackSSVM : Good for mid-size data sets (thousands), slow inference. Good for obtaining high precision solutions.
FrankWolfeSSVM : Good for fast inference, large datasets. Good for obtaining reasonable solutions fast.
We have considered 100 documents of which 70 are used for training and 30 for testing. We have used the GraphCRF model with NSlackSSVM learner. The details of the performance are as follows:
- Number of tags considered = 20
- Number of documents=100 (70 train, 30 test)
- F1 Score: test - 0.94, train - 0.95
- Data preparation time: 1.38 seconds
- Training time: 20 hours
- Prediction time: 3 minutes
- Number of learnable parameters: 334640
When we compare linear-chain CRF with GraphCRF, we have observed that the linear-chain CRF was giving a better F1 score of 0.97.