Show simple item record

Linguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding

dc.contributor.advisorSifa, Rafet
dc.contributor.authorPielka, Maren Runa Judith
dc.date.accessioned2025-07-22T05:24:51Z
dc.date.available2025-07-22T05:24:51Z
dc.date.issued22.07.2025
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13247
dc.description.abstractCurrent large language models excel in solving numerous complicated tasks, but are primarily optimized for the English language and popular application domains, delivering sub-optimal outputs for low-resource languages and specialized industry use cases. Also, they rely heavily on large amounts of training data and computing resources. In this work, we aim to tackle those issues by implementing smaller, less resource-intensive models which are trained in an informed way, leveraging linguistic knowledge about semantic and syntactic features.
To this end, we investigate methods for linguistically informed pre-training, incorporating the model with additional semantic and syntactic knowledge prior to fine-tuning on the downstream task. We specifically consider token-level prediction tasks with high semantic and syntactic relevance, such as Part-of-Speech-Tagging and Synset Prediction based on semantic webs. Our experimental results show that smaller models perform on par with larger ones when being pre-trained on those tasks, suggesting that this method can contribute to making language modeling more efficient.
Another direction of research is the creation of prototypical training corpora, exploiting both linguistic knowledge and the generative power of large pre-trained language models. We hypothesize that using those prototypical data sets in language model training will help reduce the total amount of data needed, while keeping a similar performance on the downstream task. This conjecture is being confirmed by experimental results, underlining the goal of this thesis.
To further test our hypothesis, we evaluate the informed pre-training and data generation approaches in low-resource scenarios, namely on the tasks of Natural Language Inference and Contradiction Detection in German and Arabic. We find that language model performance in those domains can be significantly improved using the aforementioned methods. Machine Translation is also being introduced as an effective method to obtain training corpora in under-researched languages.
Finally, we evaluate the efficiency of those approaches on the basis of three real-world use cases from the financial domain. We specifically look at Causality Detection, Critical Error Detection, as well as Contradiction Detection in financial reports. In all three cases, our methods provide a significant performance boost, combined with insights into the nature of language for this specific domain.
Overall, this thesis significantly contributes to the language modeling research field, exploring options to improve current paradigms for specialized scenarios and with a resource-aware objective.
en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectNLP
dc.subjectLLMs
dc.subjectLinguistik
dc.subjectInformed ML
dc.subjectGenerative KI
dc.subject.ddc004 Informatik
dc.titleLinguistically Aware and Augmentation-Driven Methods for Enhancing Natural Language Understanding
dc.typeDissertation oder Habilitation
dc.identifier.doihttps://doi.org/10.48565/bonndoc-611
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-84011
dc.relation.doihttps://doi.org/10.1109/SSCI44817.2019.9003090
dc.relation.doihttps://doi.org/10.1007/978-3-031-28238-6_46
dc.relation.doihttps://doi.org/10.7557/18.6799
dc.relation.doihttps://doi.org/10.1109/ICMLA55696.2022.00253
dc.relation.doihttps://doi.org/10.1109/ICPR48806.2021.9413257
dc.relation.doihttps://doi.org/10.1109/SSCI52147.2023.10371891
dc.relation.doihttps://doi.org/10.1109/BigData59044.2023.10386499
dc.relation.doihttps://doi.org/10.1109/BigData62323.2024.10825863
dc.relation.doihttps://doi.org/10.1007/978-3-031-88714-7_12
dc.relation.doihttps://doi.org/10.1007/s10579-025-09862-z
dc.relation.urlhttps://aclanthology.org/2020.fnp-1.10/
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID8401
ulbbnediss.date.accepted02.07.2025
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Informatik / Institut für Informatik
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeFlek, Lucie
ulbbnediss.contributor.orcidhttps://orcid.org/0000-0001-9610-6026


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright