Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding

Pielka, Maren Runa Judith

Volltext

Dokument öffnen (1.5MB)

Autor

Pielka, Maren Runa Judith

ORCID

https://orcid.org/0000-0001-9610-6026

Art der Hochschulschrift

Dissertation

Prüfungsdatum

02.07.2025

Datum der Veröffentlichung

22.07.2025

Erstgutachter

Sifa, Rafet

Zweitgutachter

Flek, Lucie

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/13247
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-84011
DOI: https://doi.org/10.48565/bonndoc-611

Inhalt

Current large language models excel in solving numerous complicated tasks, but are primarily optimized for the English language and popular application domains, delivering sub-optimal outputs for low-resource languages and specialized industry use cases. Also, they rely heavily on large amounts of training data and computing resources. In this work, we aim to tackle those issues by implementing smaller, less resource-intensive models which are trained in an informed way, leveraging linguistic knowledge about semantic and syntactic features.
To this end, we investigate methods for linguistically informed pre-training, incorporating the model with additional semantic and syntactic knowledge prior to fine-tuning on the downstream task. We specifically consider token-level prediction tasks with high semantic and syntactic relevance, such as Part-of-Speech-Tagging and Synset Prediction based on semantic webs. Our experimental results show that smaller models perform on par with larger ones when being pre-trained on those tasks, suggesting that this method can contribute to making language modeling more efficient.
Another direction of research is the creation of prototypical training corpora, exploiting both linguistic knowledge and the generative power of large pre-trained language models. We hypothesize that using those prototypical data sets in language model training will help reduce the total amount of data needed, while keeping a similar performance on the downstream task. This conjecture is being confirmed by experimental results, underlining the goal of this thesis.
To further test our hypothesis, we evaluate the informed pre-training and data generation approaches in low-resource scenarios, namely on the tasks of Natural Language Inference and Contradiction Detection in German and Arabic. We find that language model performance in those domains can be significantly improved using the aforementioned methods. Machine Translation is also being introduced as an effective method to obtain training corpora in under-researched languages.
Finally, we evaluate the efficiency of those approaches on the basis of three real-world use cases from the financial domain. We specifically look at Causality Detection, Critical Error Detection, as well as Contradiction Detection in financial reports. In all three cases, our methods provide a significant performance boost, combined with insights into the nature of language for this specific domain.
Overall, this thesis significantly contributes to the language modeling research field, exploring options to improve current paradigms for specialized scenarios and with a resource-aware objective.

Schlagwörter

NLP, LLMs, Linguistik, Informed ML, Generative KI

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://aclanthology.org/2020.fnp-1.10/
https://doi.org/10.1109/SSCI44817.2019.9003090
https://doi.org/10.1007/978-3-031-28238-6_46
https://doi.org/10.7557/18.6799
https://doi.org/10.1109/ICMLA55696.2022.00253
https://doi.org/10.1109/ICPR48806.2021.9413257
https://doi.org/10.1109/SSCI52147.2023.10371891
https://doi.org/10.1109/BigData59044.2023.10386499
https://doi.org/10.1109/BigData62323.2024.10825863
https://doi.org/10.1007/978-3-031-88714-7_12
https://doi.org/10.1007/s10579-025-09862-z

Zitiervorschlag
BibTeX

Pielka, Maren Runa Judith: Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-84011

@phdthesis{handle:20.500.11811/13247,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-84011,
doi: https://doi.org/10.48565/bonndoc-611,
author = {{Maren Runa Judith Pielka}},
title = {Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = jul,
note = {Current large language models excel in solving numerous complicated tasks, but are primarily optimized for the English language and popular application domains, delivering sub-optimal outputs for low-resource languages and specialized industry use cases. Also, they rely heavily on large amounts of training data and computing resources. In this work, we aim to tackle those issues by implementing smaller, less resource-intensive models which are trained in an informed way, leveraging linguistic knowledge about semantic and syntactic features.
To this end, we investigate methods for linguistically informed pre-training, incorporating the model with additional semantic and syntactic knowledge prior to fine-tuning on the downstream task. We specifically consider token-level prediction tasks with high semantic and syntactic relevance, such as Part-of-Speech-Tagging and Synset Prediction based on semantic webs. Our experimental results show that smaller models perform on par with larger ones when being pre-trained on those tasks, suggesting that this method can contribute to making language modeling more efficient.
Another direction of research is the creation of prototypical training corpora, exploiting both linguistic knowledge and the generative power of large pre-trained language models. We hypothesize that using those prototypical data sets in language model training will help reduce the total amount of data needed, while keeping a similar performance on the downstream task. This conjecture is being confirmed by experimental results, underlining the goal of this thesis.
To further test our hypothesis, we evaluate the informed pre-training and data generation approaches in low-resource scenarios, namely on the tasks of Natural Language Inference and Contradiction Detection in German and Arabic. We find that language model performance in those domains can be significantly improved using the aforementioned methods. Machine Translation is also being introduced as an effective method to obtain training corpora in under-researched languages.
Finally, we evaluate the efficiency of those approaches on the basis of three real-world use cases from the financial domain. We specifically look at Causality Detection, Critical Error Detection, as well as Contradiction Detection in financial reports. In all three cases, our methods provide a significant performance boost, combined with insights into the nature of language for this specific domain.
Overall, this thesis significantly contributes to the language modeling research field, exploring options to improve current paradigms for specialized scenarios and with a resource-aware objective.},
url = {https://hdl.handle.net/20.500.11811/13247}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: