Zur Kurzanzeige

Robust Information Extraction From Unstructured Documents

dc.contributor.advisorBehnke, Sven
dc.contributor.authorNamysł, Marcin
dc.date.accessioned2023-01-03T09:36:44Z
dc.date.available2023-01-03T09:36:44Z
dc.date.issued03.01.2023
dc.identifier.urihttps://hdl.handle.net/20.500.11811/10560
dc.description.abstractIn computer science, robustness can be thought of as the ability of a system to handle erroneous or nonstandard input during execution. This thesis studies the robustness of the methods that extract structured information from unstructured documents containing human language texts. Unfortunately, these methods usually suffer from various problems that prevent achieving robustness to the nonstandard inputs encountered during system execution in real-world scenarios.
Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations.
Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.
en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectRobustheit
dc.subjectInformationsextraktion
dc.subjectComputerlinguistik
dc.subjectNLP
dc.subjectTexterkennung
dc.subjectoptische Zeichenerkennung
dc.subjectOCR
dc.subjectGenerierung synthetischer Dokumente
dc.subjectNoise-Aware Training
dc.subjectSequence Labeling
dc.subjectEigennamenerkennung
dc.subjectNER
dc.subjectEinbettungen
dc.subjectSprachmodellierung
dc.subjectOCR-Fehler
dc.subjectRechtschreibfehler
dc.subjectkünstliche Fehlererzeugung
dc.subjectempirische Fehlermodellierung
dc.subjectFehlerkorrektur
dc.subjectunüberwachte Datengenerierung
dc.subjectparallele Datengenerierung
dc.subjectTabellenextraktion
dc.subjectTabellenerkennung
dc.subjectsemantische Tabelleninterpretation
dc.subjectrobustness
dc.subjectinformation extraction
dc.subjectnatural language processing
dc.subjecttext recognition
dc.subjectoptical character recognition
dc.subjectdata augmentation
dc.subjectalpha compositing
dc.subjectsynthetic document generation
dc.subjectnamed entity recognition
dc.subjectembeddings
dc.subjectOCR errors
dc.subjectmisspellings
dc.subjectartificial error generation
dc.subjectempirical error modeling
dc.subjecterror correction
dc.subjectunsupervised data generation
dc.subjectnoisy language modeling
dc.subjectparallel data generation
dc.subjecttable extraction
dc.subjecttable recognition
dc.subjectsemantic table interpretation
dc.subjectmaximum weight matching
dc.subject.ddc004 Informatik
dc.titleRobust Information Extraction From Unstructured Documents
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-69216
dc.relation.doihttps://doi.org/10.1109/ICDAR.2019.00055
dc.relation.doihttps://doi.org/10.18653/v1/2020.acl-main.138
dc.relation.doihttps://doi.org/10.18653/v1/2021.findings-acl.27
dc.relation.doihttps://doi.org/10.5220/0010767600003124
dc.relation.doihttps://doi.org/10.1093/bioinformatics/btab843
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID6921
ulbbnediss.date.accepted07.12.2022
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeBauckhage, Christian
ulbbnediss.contributor.orcidhttps://orcid.org/0000-0001-7066-1726
ulbbnediss.contributor.gnd1279843756


Dateien zu dieser Ressource

Thumbnail

Das Dokument erscheint in:

Zur Kurzanzeige

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden:

InCopyright