Show simple item record

Passive and model-agnostic sampling for training data development in machine learning regression

dc.contributor.advisorGarcke, Jochen
dc.contributor.authorClimaco, Paolo
dc.date.accessioned2025-04-28T09:20:20Z
dc.date.available2025-04-28T09:20:20Z
dc.date.issued28.04.2025
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13027
dc.description.abstractMachine learning (ML) regression has a tremendous impact on advancing scientific progress. ML regression models predict continuous label values based on input features by leveraging large labeled datasets to learn the underlying ruling mechanisms. However, using large datasets may not always be feasible due to computational limitations and high data labeling costs. Such issues often arise in scientific applications, where we can typically only label a limited number of points for training due to expensive numerical simulations or laboratory experiments. The prediction quality of regression models is highly dependent on the training data. Consequently, selecting appropriate training sets is essential to ensure accurate predictions.
This work shows that we can improve the prediction performance of ML regression models by selecting suitable training sets. We focus on passive and model-agnostic sampling, that is, selection approaches that solely rely on the data feature representations, do not consider any active learning procedure, and do not assume any specific structure for the regression model. This approach promotes the reusability of the labeled samples, ensuring that labeling efforts are not wasted on subsets useful only for specific models or tasks. First, we aim to improve the robustness of the models by minimizing their maximum prediction error. We study Farthest Point Sampling (FPS), an existing selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error of Lipschitz continuous regression models that linearly depends on the training set fill distance. Empirically, we demonstrate that minimizing the training set fill distance by sampling with FPS, thereby minimizing our derived bound, significantly reduces the maximum prediction error and improves prediction stability in Gaussian kernel regression, outperforming alternative sampling methods.
Next, we focus on improving average prediction performance. We derive an upper bound for the expected prediction error of Lipschitz continuous models that depends linearly on a weighted fill distance of the training set. We propose Density-Aware FPS (DA-FPS), a novel data selection approach. We prove that DA-FPS provides suboptimal minimizers for a data-driven estimation of the weighted fill distance, thereby attempting to minimize our derived bound. We empirically show that using DA-FPS decreases the average absolute prediction error compared to other sampling strategies.
Our experiments focus on molecular property prediction, a crucial application for drug discovery and material design, which originally motivated our research effort. Traditional methods for computing molecular properties are slow. Using ML regression allows for quick predictions, accelerating the exploration of the chemical space and the discovery of new drugs and materials. We empirically validate our findings with four distinct regression models and various datasets.
en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subject.ddc510 Mathematik
dc.titlePassive and model-agnostic sampling for training data development in machine learning regression
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-82121
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID8212
ulbbnediss.date.accepted24.03.2025
ulbbnediss.instituteMathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Mathematik / Institut für Numerische Simulation (INS)
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeRumpf, Martin
ulbbnediss.contributor.orcidhttps://orcid.org/0000-0003-1280-4930


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright