Robust Information Extraction From Unstructured Documents

Namysł, Marcin

Volltext

Dokument öffnen (5.4MB)

Autor

Namysł, Marcin

ORCID

https://orcid.org/0000-0001-7066-1726

Art der Hochschulschrift

Dissertation

Prüfungsdatum

07.12.2022

Datum der Veröffentlichung

03.01.2023

Erstgutachter

Behnke, Sven

Zweitgutachter

Bauckhage, Christian

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/10560
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-69216

Inhalt

In computer science, robustness can be thought of as the ability of a system to handle erroneous or nonstandard input during execution. This thesis studies the robustness of the methods that extract structured information from unstructured documents containing human language texts. Unfortunately, these methods usually suffer from various problems that prevent achieving robustness to the nonstandard inputs encountered during system execution in real-world scenarios.
Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations.
Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.

Schlagwörter

Robustheit, Informationsextraktion, Computerlinguistik, NLP, Texterkennung, optische Zeichenerkennung, OCR, Generierung synthetischer Dokumente, Noise-Aware Training, Sequence Labeling, Eigennamenerkennung, NER, Einbettungen, Sprachmodellierung, OCR-Fehler, Rechtschreibfehler, künstliche Fehlererzeugung, empirische Fehlermodellierung, Fehlerkorrektur, unüberwachte Datengenerierung, parallele Datengenerierung, Tabellenextraktion, Tabellenerkennung, semantische Tabelleninterpretation, robustness, information extraction, natural language processing, text recognition, optical character recognition, data augmentation, alpha compositing, synthetic document generation, named entity recognition, embeddings, OCR errors, misspellings, artificial error generation, empirical error modeling, error correction, unsupervised data generation, noisy language modeling, parallel data generation, table extraction, table recognition, semantic table interpretation, maximum weight matching

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://doi.org/10.1109/ICDAR.2019.00055
https://doi.org/10.18653/v1/2020.acl-main.138
https://doi.org/10.18653/v1/2021.findings-acl.27
https://doi.org/10.5220/0010767600003124
https://doi.org/10.1093/bioinformatics/btab843

Zitiervorschlag
BibTeX

Namysł, Marcin: Robust Information Extraction From Unstructured Documents. - Bonn, 2023. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-69216

@phdthesis{handle:20.500.11811/10560,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-69216,
author = {{Marcin Namysł}},
title = {Robust Information Extraction From Unstructured Documents},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2023,
month = jan,
note = {In computer science, robustness can be thought of as the ability of a system to handle erroneous or nonstandard input during execution. This thesis studies the robustness of the methods that extract structured information from unstructured documents containing human language texts. Unfortunately, these methods usually suffer from various problems that prevent achieving robustness to the nonstandard inputs encountered during system execution in real-world scenarios.
Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations.
Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.},
url = {https://hdl.handle.net/20.500.11811/10560}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: