Passive and model-agnostic sampling for training data development in machine learning regression

Climaco, Paolo

dc.contributor.advisor	Garcke, Jochen
dc.contributor.author	Climaco, Paolo
dc.date.accessioned	2025-04-28T09:20:20Z
dc.date.available	2025-04-28T09:20:20Z
dc.date.issued	28.04.2025
dc.identifier.uri	https://hdl.handle.net/20.500.11811/13027
dc.description.abstract	Machine learning (ML) regression has a tremendous impact on advancing scientific progress. ML regression models predict continuous label values based on input features by leveraging large labeled datasets to learn the underlying ruling mechanisms. However, using large datasets may not always be feasible due to computational limitations and high data labeling costs. Such issues often arise in scientific applications, where we can typically only label a limited number of points for training due to expensive numerical simulations or laboratory experiments. The prediction quality of regression models is highly dependent on the training data. Consequently, selecting appropriate training sets is essential to ensure accurate predictions. This work shows that we can improve the prediction performance of ML regression models by selecting suitable training sets. We focus on passive and model-agnostic sampling, that is, selection approaches that solely rely on the data feature representations, do not consider any active learning procedure, and do not assume any specific structure for the regression model. This approach promotes the reusability of the labeled samples, ensuring that labeling efforts are not wasted on subsets useful only for specific models or tasks. First, we aim to improve the robustness of the models by minimizing their maximum prediction error. We study Farthest Point Sampling (FPS), an existing selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error of Lipschitz continuous regression models that linearly depends on the training set fill distance. Empirically, we demonstrate that minimizing the training set fill distance by sampling with FPS, thereby minimizing our derived bound, significantly reduces the maximum prediction error and improves prediction stability in Gaussian kernel regression, outperforming alternative sampling methods. Next, we focus on improving average prediction performance. We derive an upper bound for the expected prediction error of Lipschitz continuous models that depends linearly on a weighted fill distance of the training set. We propose Density-Aware FPS (DA-FPS), a novel data selection approach. We prove that DA-FPS provides suboptimal minimizers for a data-driven estimation of the weighted fill distance, thereby attempting to minimize our derived bound. We empirically show that using DA-FPS decreases the average absolute prediction error compared to other sampling strategies. Our experiments focus on molecular property prediction, a crucial application for drug discovery and material design, which originally motivated our research effort. Traditional methods for computing molecular properties are slow. Using ML regression allows for quick predictions, accelerating the exploration of the chemical space and the discovery of new drugs and materials. We empirically validate our findings with four distinct regression models and various datasets.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject.ddc	510 Mathematik
dc.title	Passive and model-agnostic sampling for training data development in machine learning regression
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-82121
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	8212
ulbbnediss.date.accepted	24.03.2025
ulbbnediss.institute	Mathematisch-Naturwissenschaftliche Fakultät : Fachgruppe Mathematik / Institut für Numerische Simulation (INS)
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Rumpf, Martin
ulbbnediss.contributor.orcid	https://orcid.org/0000-0003-1280-4930

Files in this item

Name:: 8212.pdf
Size:: 3.9MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4346)

Show simple item record

The following license files are associated with this item: