Capturing data in complex layouts

Capture d’écran 2021-10-20 à 12.42.49.png

Generally, text analytics and information extraction systems utilize plain text retrieved from documents for applying machine learning and natural language processing methods. Plain text, when obtained from PDF files and other layout-aware formats such as MS Office documents, ignores the complex textual layouts, orientations and tabular structures. iQC, on the other hand, builds a uniform internal representation of any input document format. This internal representation preserves the location of every single character in the document, whether the characters are obtained after OCR or already exist in a text PDF or MS Office document.

iQC, with user feedback on its point-and-click interface, automatically builds machine learning and deep learning models to capture data while being aware of complex layouts and text orientations. iQC applies this layout-aware method to extract textual metadata, tabular data, and page and document-centric categorizations or sentiment analysis. All of these data capture modules are human-in-the-loop systems that learn from subject matter experts with a user-friendly point-and-click interface. This makes iQC well suited for data capture from complex documents common in several industries including mining, energy, construction and engineering. iQC learns to extract data from users in both semi-structured textual contexts and sentences. In the picture above, values of "Calories" and "Cooking Temperature" are extracted from semi-structured context and a sentence, respectively.