Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding

Pielka, Maren Runa Judith

dc.contributor.advisor	Sifa, Rafet
dc.contributor.author	Pielka, Maren Runa Judith
dc.date.accessioned	2025-07-22T05:24:51Z
dc.date.available	2025-07-22T05:24:51Z
dc.date.issued	22.07.2025
dc.identifier.uri	https://hdl.handle.net/20.500.11811/13247
dc.description.abstract	Current large language models excel in solving numerous complicated tasks, but are primarily optimized for the English language and popular application domains, delivering sub-optimal outputs for low-resource languages and specialized industry use cases. Also, they rely heavily on large amounts of training data and computing resources. In this work, we aim to tackle those issues by implementing smaller, less resource-intensive models which are trained in an informed way, leveraging linguistic knowledge about semantic and syntactic features. To this end, we investigate methods for linguistically informed pre-training, incorporating the model with additional semantic and syntactic knowledge prior to fine-tuning on the downstream task. We specifically consider token-level prediction tasks with high semantic and syntactic relevance, such as Part-of-Speech-Tagging and Synset Prediction based on semantic webs. Our experimental results show that smaller models perform on par with larger ones when being pre-trained on those tasks, suggesting that this method can contribute to making language modeling more efficient. Another direction of research is the creation of prototypical training corpora, exploiting both linguistic knowledge and the generative power of large pre-trained language models. We hypothesize that using those prototypical data sets in language model training will help reduce the total amount of data needed, while keeping a similar performance on the downstream task. This conjecture is being confirmed by experimental results, underlining the goal of this thesis. To further test our hypothesis, we evaluate the informed pre-training and data generation approaches in low-resource scenarios, namely on the tasks of Natural Language Inference and Contradiction Detection in German and Arabic. We find that language model performance in those domains can be significantly improved using the aforementioned methods. Machine Translation is also being introduced as an effective method to obtain training corpora in under-researched languages. Finally, we evaluate the efficiency of those approaches on the basis of three real-world use cases from the financial domain. We specifically look at Causality Detection, Critical Error Detection, as well as Contradiction Detection in financial reports. In all three cases, our methods provide a significant performance boost, combined with insights into the nature of language for this specific domain. Overall, this thesis significantly contributes to the language modeling research field, exploring options to improve current paradigms for specialized scenarios and with a resource-aware objective.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	NLP
dc.subject	LLMs
dc.subject	Linguistik
dc.subject	Informed ML
dc.subject	Generative KI
dc.subject.ddc	004 Informatik
dc.title	Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding
dc.type	Dissertation oder Habilitation
dc.identifier.doi	https://doi.org/10.48565/bonndoc-611
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-84011
dc.relation.doi	https://doi.org/10.1109/SSCI44817.2019.9003090
dc.relation.doi	https://doi.org/10.1007/978-3-031-28238-6_46
dc.relation.doi	https://doi.org/10.7557/18.6799
dc.relation.doi	https://doi.org/10.1109/ICMLA55696.2022.00253
dc.relation.doi	https://doi.org/10.1109/ICPR48806.2021.9413257
dc.relation.doi	https://doi.org/10.1109/SSCI52147.2023.10371891
dc.relation.doi	https://doi.org/10.1109/BigData59044.2023.10386499
dc.relation.doi	https://doi.org/10.1109/BigData62323.2024.10825863
dc.relation.doi	https://doi.org/10.1007/978-3-031-88714-7_12
dc.relation.doi	https://doi.org/10.1007/s10579-025-09862-z
dc.relation.url	https://aclanthology.org/2020.fnp-1.10/
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	8401
ulbbnediss.date.accepted	02.07.2025
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Flek, Lucie
ulbbnediss.contributor.orcid	https://orcid.org/0000-0001-9610-6026

Files in this item

Name:: 8401.pdf
Size:: 1.5MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4581)

Show simple item record

The following license files are associated with this item: