Passive and model-agnostic sampling for training data development in machine learning regression

Climaco, Paolo

Volltext

Dokument öffnen (3.9MB)

Autor

Climaco, Paolo

ORCID

https://orcid.org/0000-0003-1280-4930

Art der Hochschulschrift

Dissertation

Prüfungsdatum

24.03.2025

Datum der Veröffentlichung

28.04.2025

Erstgutachter

Garcke, Jochen

Zweitgutachter

Rumpf, Martin

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/13027
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-82121

Inhalt

Machine learning (ML) regression has a tremendous impact on advancing scientific progress. ML regression models predict continuous label values based on input features by leveraging large labeled datasets to learn the underlying ruling mechanisms. However, using large datasets may not always be feasible due to computational limitations and high data labeling costs. Such issues often arise in scientific applications, where we can typically only label a limited number of points for training due to expensive numerical simulations or laboratory experiments. The prediction quality of regression models is highly dependent on the training data. Consequently, selecting appropriate training sets is essential to ensure accurate predictions.
This work shows that we can improve the prediction performance of ML regression models by selecting suitable training sets. We focus on passive and model-agnostic sampling, that is, selection approaches that solely rely on the data feature representations, do not consider any active learning procedure, and do not assume any specific structure for the regression model. This approach promotes the reusability of the labeled samples, ensuring that labeling efforts are not wasted on subsets useful only for specific models or tasks. First, we aim to improve the robustness of the models by minimizing their maximum prediction error. We study Farthest Point Sampling (FPS), an existing selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error of Lipschitz continuous regression models that linearly depends on the training set fill distance. Empirically, we demonstrate that minimizing the training set fill distance by sampling with FPS, thereby minimizing our derived bound, significantly reduces the maximum prediction error and improves prediction stability in Gaussian kernel regression, outperforming alternative sampling methods.
Next, we focus on improving average prediction performance. We derive an upper bound for the expected prediction error of Lipschitz continuous models that depends linearly on a weighted fill distance of the training set. We propose Density-Aware FPS (DA-FPS), a novel data selection approach. We prove that DA-FPS provides suboptimal minimizers for a data-driven estimation of the weighted fill distance, thereby attempting to minimize our derived bound. We empirically show that using DA-FPS decreases the average absolute prediction error compared to other sampling strategies.
Our experiments focus on molecular property prediction, a crucial application for drug discovery and material design, which originally motivated our research effort. Traditional methods for computing molecular properties are slow. Using ML regression allows for quick predictions, accelerating the exploration of the chemical space and the discovery of new drugs and materials. We empirically validate our findings with four distinct regression models and various datasets.

Klassifikation (DDC)

510 Mathematik

Zitiervorschlag
BibTeX

Climaco, Paolo: Passive and model-agnostic sampling for training data development in machine learning regression. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-82121

@phdthesis{handle:20.500.11811/13027,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-82121,
author = {{Paolo Climaco}},
title = {Passive and model-agnostic sampling for training data development in machine learning regression},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = apr,
note = {Machine learning (ML) regression has a tremendous impact on advancing scientific progress. ML regression models predict continuous label values based on input features by leveraging large labeled datasets to learn the underlying ruling mechanisms. However, using large datasets may not always be feasible due to computational limitations and high data labeling costs. Such issues often arise in scientific applications, where we can typically only label a limited number of points for training due to expensive numerical simulations or laboratory experiments. The prediction quality of regression models is highly dependent on the training data. Consequently, selecting appropriate training sets is essential to ensure accurate predictions.
This work shows that we can improve the prediction performance of ML regression models by selecting suitable training sets. We focus on passive and model-agnostic sampling, that is, selection approaches that solely rely on the data feature representations, do not consider any active learning procedure, and do not assume any specific structure for the regression model. This approach promotes the reusability of the labeled samples, ensuring that labeling efforts are not wasted on subsets useful only for specific models or tasks. First, we aim to improve the robustness of the models by minimizing their maximum prediction error. We study Farthest Point Sampling (FPS), an existing selection approach that aims to minimize the fill distance of the selected set. We derive an upper bound for the maximum expected prediction error of Lipschitz continuous regression models that linearly depends on the training set fill distance. Empirically, we demonstrate that minimizing the training set fill distance by sampling with FPS, thereby minimizing our derived bound, significantly reduces the maximum prediction error and improves prediction stability in Gaussian kernel regression, outperforming alternative sampling methods.
Next, we focus on improving average prediction performance. We derive an upper bound for the expected prediction error of Lipschitz continuous models that depends linearly on a weighted fill distance of the training set. We propose Density-Aware FPS (DA-FPS), a novel data selection approach. We prove that DA-FPS provides suboptimal minimizers for a data-driven estimation of the weighted fill distance, thereby attempting to minimize our derived bound. We empirically show that using DA-FPS decreases the average absolute prediction error compared to other sampling strategies.
Our experiments focus on molecular property prediction, a crucial application for drug discovery and material design, which originally motivated our research effort. Traditional methods for computing molecular properties are slow. Using ML regression allows for quick predictions, accelerating the exploration of the chemical space and the discovery of new drugs and materials. We empirically validate our findings with four distinct regression models and various datasets.},
url = {https://hdl.handle.net/20.500.11811/13027}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: