Robust Information Extraction From Unstructured Documents

Namysł, Marcin

dc.contributor.advisor	Behnke, Sven
dc.contributor.author	Namysł, Marcin
dc.date.accessioned	2023-01-03T09:36:44Z
dc.date.available	2023-01-03T09:36:44Z
dc.date.issued	03.01.2023
dc.identifier.uri	https://hdl.handle.net/20.500.11811/10560
dc.description.abstract	In computer science, robustness can be thought of as the ability of a system to handle erroneous or nonstandard input during execution. This thesis studies the robustness of the methods that extract structured information from unstructured documents containing human language texts. Unfortunately, these methods usually suffer from various problems that prevent achieving robustness to the nonstandard inputs encountered during system execution in real-world scenarios. Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations. Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	Robustheit
dc.subject	Informationsextraktion
dc.subject	Computerlinguistik
dc.subject	NLP
dc.subject	Texterkennung
dc.subject	optische Zeichenerkennung
dc.subject	OCR
dc.subject	Generierung synthetischer Dokumente
dc.subject	Noise-Aware Training
dc.subject	Sequence Labeling
dc.subject	Eigennamenerkennung
dc.subject	NER
dc.subject	Einbettungen
dc.subject	Sprachmodellierung
dc.subject	OCR-Fehler
dc.subject	Rechtschreibfehler
dc.subject	künstliche Fehlererzeugung
dc.subject	empirische Fehlermodellierung
dc.subject	Fehlerkorrektur
dc.subject	unüberwachte Datengenerierung
dc.subject	parallele Datengenerierung
dc.subject	Tabellenextraktion
dc.subject	Tabellenerkennung
dc.subject	semantische Tabelleninterpretation
dc.subject	robustness
dc.subject	information extraction
dc.subject	natural language processing
dc.subject	text recognition
dc.subject	optical character recognition
dc.subject	data augmentation
dc.subject	alpha compositing
dc.subject	synthetic document generation
dc.subject	named entity recognition
dc.subject	embeddings
dc.subject	OCR errors
dc.subject	misspellings
dc.subject	artificial error generation
dc.subject	empirical error modeling
dc.subject	error correction
dc.subject	unsupervised data generation
dc.subject	noisy language modeling
dc.subject	parallel data generation
dc.subject	table extraction
dc.subject	table recognition
dc.subject	semantic table interpretation
dc.subject	maximum weight matching
dc.subject.ddc	004 Informatik
dc.title	Robust Information Extraction From Unstructured Documents
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-69216
dc.relation.doi	https://doi.org/10.1109/ICDAR.2019.00055
dc.relation.doi	https://doi.org/10.18653/v1/2020.acl-main.138
dc.relation.doi	https://doi.org/10.18653/v1/2021.findings-acl.27
dc.relation.doi	https://doi.org/10.5220/0010767600003124
dc.relation.doi	https://doi.org/10.1093/bioinformatics/btab843
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	6921
ulbbnediss.date.accepted	07.12.2022
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Bauckhage, Christian
ulbbnediss.contributor.orcid	https://orcid.org/0000-0001-7066-1726
ulbbnediss.contributor.gnd	1279843756

Files in this item

Name:: 6921.pdf
Size:: 5.4MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4581)

Show simple item record

The following license files are associated with this item: