The objective of the paper is to extract structured information from unstructured documents such as invoices and receipts.

### Document Modeling

Each document is modeled as a graph of text segments, where each text segment is comprised of the position of the segment and text with in it. The graph is comprised of nodes that represent text segments, and edges that represent visual dependencies, such as relative shapes and distance, between two nodes.

A document $\mathcal{D}$ is a tuple $( T , E )$ , where $ T = \left \{ t _ { 1 } , t _ { 2 } , \cdots , t _ { n } \right \}$, $t_{i} \in \mathcal{T}$ is a set of $n$ text boxes/nodes, $R = \left \{ r _ {i 1 } , r _ { i2 } , \cdots , r _ { ij } \right \}$, $r_{ij} \in \mathcal{R}$ is a set of edges, and $E = T \times R \times T$ is a set of directed edges of the form $(t_{i}, r_{ij}, t_{j})$, where $t_{i}, t_{j} \in T$ and $r_{ij} \in R$. Every node is connected to each other.

#### Feature Extraction

For node $t_i$, a node embedding $\bf{t}_{i}$ is calculated using a single layer Bi-LSTM to extract features from the text content in the segment.

Edge embedding between node $t_i$ and $t_j$ is defined as,

$$\mathbf { r } _ { i j } = \left[ x _ { i j } , y _ { i j } , \frac { w _ { i } } { h _ { i } } , \frac { h _ { j } } { h _ { i } } , \frac { w _ { j } } { h _ { i } } \right]$$

where $x_{ij}$ and $y_{ij}$ are horizontal and vertical distance between the two text boxes respectively and $w_i$ and $h_i$ are the width and height of the corresponding text box. The last three values in the embedding are the aspect ratio of node $t_i$, relative height and width of node $t_j$.

#### Graph Convolution

The convolution is defined on the node-edge-node triplets $(t_{i}, r_{ij}, t_{i})$ instead of on only node.For node $t_{i}$, for each nieghbour $t_{j}$ the visual features are combined into the neighbourhood representation by using a multi layer perceptron network.

$$\mathbf { h } _ { i j } = g \left( \mathbf { t } _ { i } , \mathbf { r } _ { i j } , \mathbf { t } _ { j } \right) = \operatorname { MLP } \left( \left[ \mathbf { t } _ { i } \left| \mathbf { r } _ { i j } \right| \mathbf { t } _ { j } \right] \right) $$

where $||$ is the concatenate operation.

The graph convolution defined based on self-attention mechanism. Since it is a fully connected graph, the output hidden representation is computed by attending to all the other nodes.

The output embedding $\mathbf { t } _ { i } ^ { \prime }$ of the layer for node $t_{i}$ is computed by

$$\mathbf { t } _ { i } ^ { \prime } = \sigma \left( \sum _ { j \in { 1 , \cdots , n } } \alpha _ { i j } \mathbf { h } _ { i j } \right)$$

where $\alpha{ij}$ are the attention coefficients and $\sigma$ is an activation function.

The attention mechanism is,

$$\alpha _ { i j } = \frac { \exp \left( \text { Leaky Relu } \left( \mathbf { w } _ { a } ^ { T } \mathbf { h } _ { i j } \right) \right) } { \sum _ { j \in { 1 , \cdots , n } } \exp \left( \operatorname { Leaky } \operatorname { Relu } \left( \mathbf { w } _ { a } ^ { T } \mathbf { h } _ { i j } \right) \right) }$$

The node embedding $\mathbf { t } _ { i } ^ { \prime }$ is concatentated with word2vec token embeddings of the tokens present in the node and they are fed to a Bi-LSTM-CRF network for entity extraction.

#### Comments

- This work shows a novel way to represent documents by graphs where you can naturally combine visual and textual features. By using graph convolutions multimodal features were efficiently combined for downstream tasks.
- Adding metadata information to the features might boost the performance.
- The latest language models such as BERT, Transformers, ELMO, ULMFiT etc can be used instead of word2vec.