TabIO: Extraction of tables from unstructured documents and images with machine learning and graphs

1. Introduction

AgileDD has developed an approach and open source code, named “TabIO”, for the extraction of tabular data from unstructured documents such as PDFs and images. In this approach machine learning is applied to classify and model several parts of the table and the document, and those parts are tied together to model and extract tables using a dynamic decoding algorithm. The method in this project was developed by AgileDD to adhere to the following set of principles:

  1. The approach must learn continuously from errors and human feedback

  2. The approach must be flexible to adapt to new types of documents if it does not perform well on those documents “out of the box”

  3. Multiple types of cues must be accepted by the approach such as text layout, textual content, and in the future, even lines and figures.

  4. Extracting tables by themselves can sometimes be not enough without the context of the whole document. There can be significant advantage to view the table in the context of the document.

2. Fundamental Concepts of the Approach

TabIO is utilizes deep learning in combination a Bayesian philosophy of combining prior and posterior probability scores, similar to automatic speech recognition. To introduce the table extraction method developed in this project, it helps to look at a Bayesian approach for finding best hypotheses from observations. The Maximum A Posteriori method picks a hypothesis H that maximizes the probability P(H|O) where O is the observation. This is typically modeled as a combination of priors and likelihoods:

Typically, in complex machine learning applications such as automatic speech recognition priors and likelihoods are modeled separately, where likelihoods of observations are generated as outputs of machine or deep learning. Documents, like speech or language, are made of sequential data contain sequences of lines, tabular or otherwise. The concept of processing by lines was inspired from Liu 2018 (reference at the bottom).

2.1 Line Classification

Several types of lines were required to be modeled in documents, for example, dense lines that appear in paragraphs, different types of sparse lines that appear in tables, titles, headings, footers, etc. Table 1 below shows the different labels that we have identified to model documents. Lines are generally marked as double column and single column for the purpose of initial labeling for training the line classification models. In the extraction phase, single or double columns may be identified upfront and rest of the processing may proceed independent of whether the text appears in a single column or a double column setting.

Table 1: Taxonomy of line labels used to initiate the TabIO training of the models.

2.2 Priors

Prior probabilities or simply “priors” are representatives of how commonly a sequence is likely to occur in a document. We are modeling H as a sequence of classes of lines and O as the actual observed lines. Certain sequences of lines are likely to occur with a higher probability compared to other sequences of lines. For example, the sequence in Figure 1(a) is more likely to occur than the sequence in Figure 4(b). The sample in Figure 1(a) is captured from an actual document and the line labels are marked to the right of the image of the sample, while the sample in Figure 1(b) is manually created by jumbling up the lines. The significance of priors is that even if machine learning makes mistakes in classifying a few lines in a document, the low prior probability, such as for the case in Figure 1(b), will reduce the overall probability of such a hypothesis. While the example in figure 1(b) is artificially generated, it can happen in real situations that machine learning would infer a line in the middle of a paragraph as a sparse table line, but in the absence of more table lines the prior probability of a sequence of dense lines that contains only one table line will be smaller. Instead, a lower likelihood sequence where all the lines are classified as dense lines will be dominant because of the higher priors.

Prior probabilities are generally generated as bigrams or trigrams, or alternatively, they may be generated using a deep neural network trained on actual observations of the line classes in the training data. Trigrams are sets of probabilities of a class, given the previous two classes are known and bigrams are probabilities of a class given one previous class is known. In this project, trigram probabilities are computed from data and applied in decoding.

a) A highly likely sequence of line and corresponding labels. Note that single/column distinction is removed for simplicity.

b) A highly unlikely sequence of lines and corresponding labels. Note that single/column distinction is removed for simplicity.

Figure 1: Highly likely and highly unlikely sequences of line types in documents

2.3 Likelihoods

Likelihoods are thought of as probabilities of seeing an observation given a hypothesis. In this project, the hypotheses are made of sequences of line labels. Observations are lines themselves, but before machine learning based likelihoods can be computed raw lines need to be mapped to feature vectors. These feature vectors are then inputs for machine learning based classifiers, and the probabilistic outputs of machine learning classifiers are used as likelihoods. In Liu 2008, features consisted of several hand-designed measurements done on the lines including text layout and orthographic features. To refrain from hand-designing features, automatically generated features have been utilized in this project including:

  1. Deep learning based layout features for line-wise classification

  2. Lexical and orthographic features extracted using normalized word or character counts, for example, TF-IDF transform

Figure 2(a) and 2(b) shows example of likelihood computation for a sparse table line and a dense paragraph line, respectively. Low and high likelihoods are shown for different classes for a well-trained machine learning model.

a) Likelihoods generated for a sparse line from a table.

b) Likelihoods generated from a dense line from a paragraph in a document

Figure 2: Sample likelihoods generated from machine/deep learning based classifiers.

2.4 Hypothesis sequences and Viterbi

Each line in a document is assigned a probability for each of the classes by the machine/deep learning classifiers. Therefore, a large number of hypothesis sequences are possible for a set of lines in a document. The best hypothesis sequence is the one which has the highest product of likelihoods and priors for the classes in that sequence. Figure 3 (a) shows a correct sequence where we consider the likelihoods generated from machine learning system are high, as expected from reasonably accurate machine learning classifiers.

But we could also consider another hypothesis sequence where some of the lines are incorrectly labeled. For an accurate machine learning classifier, the likelihoods are low where the line label outputs are wrong. This case is shown in Figure 3(b). It should be noted that in Figure 3(b), while some of the likelihoods generated by machine learning are low, the priors are still high. This is because the sequence of labels Figure 3(b) are certainly likely to occur in documents. This is opposite of the case mentioned in Section “Priors” above where prior probabilities were reducing the overall score of a sequence.

As described above there can be a large number of hypotheses. In fact, if there are N classes and L lines then there would be LM hypothesis sequences. Yet, not all hypothesis sequences are required to be searched through. Likelihoods and priors are combined and searched optimally using graphs. Viterbi decoding lets the hypothesis selection process to be modeled as a graph search problem such that each path through the graph is a hypothesis. During this graph search, paths with low overall scores (products of priors and likelihoods) are purged and only high scoring search paths are kept.

a) A case of correct sequence hypothesis such that both priors and likelihoods are high.

b) A case of incorrect sequence hypothesis such that priors are high but likelihoods are lower than case (a)

Figure 3: Hypothesis sequences for correct and incorrect class labels The sequential prior probabilities and likelihoods from machine learning are combined with the Viterbi algorithm for sequential decoding.

2.5 Overall Flow Diagram for Training

A flow diagram of the overall procedure for training the table extraction system is presented in Figure 4 below. An input document for training is first processed by an optical character recognition (OCR) system, although it is to be noted that OCR is not included in open source. AgileDD’s proprietary iQC platform was utilized to generate OCR in this project but users may apply other OCR technologies as long the OCR text output is overlayed in a PDF file. Plain text output from OCR is not suitable for use with TabIO. The following steps are undertaken after the OCR has been applied to the document:

  1. Character locations: Location and size of each character is extracted as an X,Y coordinate on each page of the document using the PDFBOX tool

  2. Line tokenization: Lines are identified by grouping the characters according to their X,Y locations on the page, as described in the next section

  3. Integration of image labels and line tokenization: A label is assigned to each line according to the line location and the label from the LabeImg labeling

  4. Feature extraction for column detection: Vision-based layout features are extracted for each line as detailed in a section below

  5. Feature extraction for line classification: Vision-based layout features and lexical features are extracted for each line within each column

  6. Training of column detector deep learning model: A convolutional neural network is trained to detect one-column versus two-column layout for each line

  7. Training of line classification deep learning model: A convolutional neural network is trained to classify each line into one of the desired line categories

Figure 4: Overall training architecture

2.6 Integration of image labels and line tokenization

LabelImg helps label regions on images instead of labeling each line. The goal in this project was to build classifiers for lines, therefore, this process adds additional software requirements since individual text lines still need to be assigned labels. This has been done by utilizing the open source tool PDFBOX that provides the location in pixels for each character in a text PDF file. Therefore, by getting locations of characters from PDFBOX and labels for regions of the page from LabelImg, a label can then be assigned to each line. Figure 5 shows an output of this process where the LabelImg labels are shown in green, and lines and other text in the document obtained from PDFBOX is shown in red. The red boxes, representing the lines, were created as a union of individual masks around each character by utilizing the coordinates and the sizes of the characters from PDFBOX.

Figure 5: LabelImg boxed regions shown along with lines extracted from PDFBOX

2.7 Feature extraction

The goal of this project was to be able to apply both text layout features and the lexical and orthographic features in the classification of table lines. The extraction of layout and lexical features are described in more detail in this section. To calculate the layout features of each line, the following procedure is followed:

(a) A set of N lines above and N lines below are selected for use in feature extraction in addition to the line under analysis, making it a total of 2N+1 lines.

(b) By taking the locations of the characters from the PDFBOX output and their sizes, a rectangular mask is applied on each character to completely cover the character

(c) An image is generated that is composed of the masked characters

(d) The background artifacts are included in this image such as table lines and figure contents. The background artifacts are necessary to be able to distinguish tables lines from lines that appear inside of the figures.

(e) The final image used for machine learning is an image that contains masked characters and the background artifacts, as shown below

Figure 6: Layout feature extraction for a paragraph line on the left and for a table line on the right. The background table separating lines are shown in red.

To generate the lexical and orthographic features, a standard word count method followed by normalization is applied. This is called a TFIDF transform. In this approach the following steps are taken to calculate the lexical and orthographic features:

(a) TFIDF transform is computed for both word and characters

(b) A singular value decomposition (SVD) is applied separately to the word transform and to the character transform

(c) The SVD output for the word transform and the SVD output for the character transform are appended to each other to get the final numerical vector representation of the lexical and orthographic content of the line

2.8 Deep Learning Architecture

A deep learning architecture shown in Figure 7 below is applied for line classification. In this architecture, the layout image goes through the usual convolutional layers, similar to convolutional networks used for image classification. After the convolutional layers, the output of the convolutional layers is appended with the lexical features that contain the concatenation of SVD outputs for word and character models. Please note that the “label” is known during training, and not known when the network is applied for inference.

Figure 7: Deep learning training that combines lexical and layout image features

3 Inference Method

Inference follows a similar process as training, except that the labels are not known and are generated by machine learning and Viterbi decoding. The following steps are undertaken for the inference process on a PDF document that already has OCR applied. The flow diagram of this process is shown in Figure 8

  1. Character locations: Location and size of each character is extracted in as an X,Y coordinate on each page of the document using the PDFBOX tool

  2. Line tokenization: Lines are identified by grouping the characters according to their X,Y locations on the page

  3. Feature extraction for column detection: Vision-based layout features are extracted for each line.

  4. Column detection: Each line is classified as single column or double column. Subsequent processing is then done within each column.

  5. Feature extraction for line classification: Vision-based layout features and lexical features are extracted for each line within each column

  6. Line classification: Each line is assigned a likelihood score for each class using the line classification model

  7. Viterbi decoding: Viterbi algorithm is finally applied to combine likelihood scores with priors in a dynamic graph search

Figure 8: Flowchart of TabIO inference process


Y. Liu, P. Mitra and C. L. Giles, “Identifying Table Boundaries in Digital Documents via Sparse Line Detection”, CIKM 2008

247 views0 comments