Hybrid Representation Learning for Information Extraction

Deußer, Tobias Kurt Stefan

Volltext

Dokument öffnen (2.3MB)

Autor

Deußer, Tobias Kurt Stefan

ORCID

https://orcid.org/0000-0003-4685-0847

Art der Hochschulschrift

Dissertation

Prüfungsdatum

06.02.2026

Datum der Veröffentlichung

25.03.2026

Erstgutachter

Sifa, Rafet

Zweitgutachter

Bauckhage, Christian

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/14010
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-87999
DOI: https://doi.org/10.48565/bonndoc-823

Inhalt

In the contemporary digital era, the exponential increase in unstructured and semi-structured data has made information extraction a cornerstone of modern data-driven research and application. The ability to transform such raw information into structured knowledge is crucial for enabling later downstream tasks. While traditional rule-based and statistical approaches to information extraction have demonstrated success in narrow, well-defined tasks, they lack the scalability and adaptability required to address the vastness and variability of present-day data. Conversely, deep neural models and especially large language models have shown remarkable capabilities in language understanding, yet they remain constrained by high computational costs and susceptibility to hallucination.
This thesis explores the unification of various symbolic, statistical, and neural paradigms into a cohesive hybrid framework. The central hypothesis is that by combining the strengths of data-driven representation learning with structural, rule-based, and multimodal knowledge, one can achieve information extraction systems that are more accurate, efficient, and reliable than their monolithic counterparts. To test this hypothesis, the thesis investigates a range of hybrid architectures across five key application domains.
In the financial domain, a hybrid contradiction detection framework integrates syntactic pre-training with transformer-based representations and clustering algorithms to identify inconsistencies within large-scale financial reports. For named entity recognition, the iNERD algorithm introduces rule-based constraints to guide large language models, producing syntactically valid, hallucination-free entity extractions. Thereafter, the anonymisation study leverages knowledge distillation to compress the language understanding capabilities of large decoder-only models into lightweight encoder-only architectures, enabling secure and efficient text anonymisation. In relation extraction, this work presents KPI-BERT and the open-source KPI-EDGAR dataset, combining contextual embedding models with recurrent layers and noise-based regularisation to extract key performance indicators from financial documents. Extending beyond text, the final empirical contribution introduces a multimodal dementia detection framework that fuses linguistic and acoustic representations, offering a robust approach to early, non-invasive diagnosis.
Together, these studies provide compelling evidence that hybrid representation learning constitutes an important paradigm for modern information extraction. This research demonstrates that hybrid systems can achieve higher precision, stronger generalisability, and improved efficiency while remaining adaptable to real-world constraints. The findings of this thesis therefore advance the field towards more trustworthy, sustainable, and application-ready artificial intelligence.

Schlagwörter

Machine Learning, Representation Learning, Information Extraction, Natural Language Processing, Contradiction Detection, Named Entity Recognition, Anonymisation, Relation Extraction, Dementia Detection

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://aclanthology.org/2025.coling-industry.20/
https://doi.org/10.1109/ICPR56361.2022.9956191
https://doi.org/10.1109/ICMLA55696.2022.00254
https://doi.org/10.7557/18.6799
https://doi.org/10.1109/ICMLA58977.2023.00274
https://doi.org/10.1109/BigData59044.2023.10386673
https://doi.org/10.1109/BigData62323.2024.10825603

Zitiervorschlag
BibTeX

Deußer, Tobias Kurt Stefan: Hybrid Representation Learning for Information Extraction. - Bonn, 2026. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-87999

@phdthesis{handle:20.500.11811/14010,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-87999,
doi: https://doi.org/10.48565/bonndoc-823,
author = {{Tobias Kurt Stefan Deußer}},
title = {Hybrid Representation Learning for Information Extraction},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2026,
month = mar,
note = {In the contemporary digital era, the exponential increase in unstructured and semi-structured data has made information extraction a cornerstone of modern data-driven research and application. The ability to transform such raw information into structured knowledge is crucial for enabling later downstream tasks. While traditional rule-based and statistical approaches to information extraction have demonstrated success in narrow, well-defined tasks, they lack the scalability and adaptability required to address the vastness and variability of present-day data. Conversely, deep neural models and especially large language models have shown remarkable capabilities in language understanding, yet they remain constrained by high computational costs and susceptibility to hallucination.
This thesis explores the unification of various symbolic, statistical, and neural paradigms into a cohesive hybrid framework. The central hypothesis is that by combining the strengths of data-driven representation learning with structural, rule-based, and multimodal knowledge, one can achieve information extraction systems that are more accurate, efficient, and reliable than their monolithic counterparts. To test this hypothesis, the thesis investigates a range of hybrid architectures across five key application domains.
In the financial domain, a hybrid contradiction detection framework integrates syntactic pre-training with transformer-based representations and clustering algorithms to identify inconsistencies within large-scale financial reports. For named entity recognition, the iNERD algorithm introduces rule-based constraints to guide large language models, producing syntactically valid, hallucination-free entity extractions. Thereafter, the anonymisation study leverages knowledge distillation to compress the language understanding capabilities of large decoder-only models into lightweight encoder-only architectures, enabling secure and efficient text anonymisation. In relation extraction, this work presents KPI-BERT and the open-source KPI-EDGAR dataset, combining contextual embedding models with recurrent layers and noise-based regularisation to extract key performance indicators from financial documents. Extending beyond text, the final empirical contribution introduces a multimodal dementia detection framework that fuses linguistic and acoustic representations, offering a robust approach to early, non-invasive diagnosis.
Together, these studies provide compelling evidence that hybrid representation learning constitutes an important paradigm for modern information extraction. This research demonstrates that hybrid systems can achieve higher precision, stronger generalisability, and improved efficiency while remaining adaptable to real-world constraints. The findings of this thesis therefore advance the field towards more trustworthy, sustainable, and application-ready artificial intelligence.},
url = {https://hdl.handle.net/20.500.11811/14010}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: