Exploring and Addressing General Limitations of Compound Potency Predictions using Machine Learning

Borges Janela, Tiago

Volltext

Dokument öffnen (16.6MB)

Autor

Borges Janela, Tiago

ORCID

https://orcid.org/0000-0002-0782-3021

Art der Hochschulschrift

Dissertation

Prüfungsdatum

04.02.2025

Datum der Veröffentlichung

24.03.2025

Erstgutachter

Bajorath, Jürgen

Zweitgutachter

Fröhlich, Holger

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/12941
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-81116

Inhalt

Compound potency prediction is a major task in computational drug discovery. Regression models based on machine learning (ML) approaches have become popular for small molecule potency predictions. Recently, deep learning (DL) methods have introduced novel architectures and data representations that have been applied to molecular potency predictions. Upon introducing a new computational approach, initial performance assessment is carried out using benchmark studies. Conventional benchmark calculations use compound potency data against a specific target divided into training sets for model generation and test sets for performance assessment over several rounds of cross-validation. Under these conditions, performance differences between prediction models are often negligible and do not translate into a successful application in prospective tasks. The mechanisms underlying these small performance differences are yet to be determined. This dissertation investigates the intrinsic limitations of current benchmark settings for compound potency predictions using ML models. The first study compares traditional ML, DL, and control models’ performance under different test conditions for several compound activity classes. Next, potency predictions are extended to a wide range of activity classes, using ML and control models. The impact of data composition and potency ranges on prediction accuracy is determined based on different data set generation strategies. At this stage, limitations associated with potency prediction benchmarks, such as limited differences between predictive ML/DL and control models are uncovered. Furthermore, ML/DL and control models are derived with original and modified training sets of increasing compound sizes. Prediction performance is determined over several potency sub-ranges to rationalize the unveiled benchmark limitations. Moreover, the impact of structural analogs on prediction models is determined using a newly designed compound pair-based evaluation scheme to monitor performance over increasing compound potency differences. Additionally, a novel DL method for compound potency predictions is introduced and compared to state-of-the-art ML models for the prediction of potent compounds. Finally, alternative evaluation schemes are explored, and possible future steps toward better benchmark systems for ML potency predictions are discussed. Taken together, this thesis uncovers current limitations of benchmark systems for comparing ML models and offers alternative approaches to better determine compound potency prediction performance.

Schlagwörter

compound potency predictions, machine learning, performance evaluation

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://doi.org/10.1038/s42256-022-00581-6
https://doi.org/10.3390/ph16040530
https://doi.org/10.1038/s41598-023-45086-3
https://doi.org/10.1021/acs.jcim.3c01530
https://doi.org/10.3390/biom13020393
https://doi.org/10.1016/j.xcrp.2024.101988

Zitiervorschlag
BibTeX

Borges Janela, Tiago: Exploring and Addressing General Limitations of Compound Potency Predictions using Machine Learning. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-81116

@phdthesis{handle:20.500.11811/12941,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-81116,
author = {{Tiago Borges Janela}},
title = {Exploring and Addressing General Limitations of Compound Potency Predictions using Machine Learning},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = mar,
note = {Compound potency prediction is a major task in computational drug discovery. Regression models based on machine learning (ML) approaches have become popular for small molecule potency predictions. Recently, deep learning (DL) methods have introduced novel architectures and data representations that have been applied to molecular potency predictions. Upon introducing a new computational approach, initial performance assessment is carried out using benchmark studies. Conventional benchmark calculations use compound potency data against a specific target divided into training sets for model generation and test sets for performance assessment over several rounds of cross-validation. Under these conditions, performance differences between prediction models are often negligible and do not translate into a successful application in prospective tasks. The mechanisms underlying these small performance differences are yet to be determined. This dissertation investigates the intrinsic limitations of current benchmark settings for compound potency predictions using ML models. The first study compares traditional ML, DL, and control models’ performance under different test conditions for several compound activity classes. Next, potency predictions are extended to a wide range of activity classes, using ML and control models. The impact of data composition and potency ranges on prediction accuracy is determined based on different data set generation strategies. At this stage, limitations associated with potency prediction benchmarks, such as limited differences between predictive ML/DL and control models are uncovered. Furthermore, ML/DL and control models are derived with original and modified training sets of increasing compound sizes. Prediction performance is determined over several potency sub-ranges to rationalize the unveiled benchmark limitations. Moreover, the impact of structural analogs on prediction models is determined using a newly designed compound pair-based evaluation scheme to monitor performance over increasing compound potency differences. Additionally, a novel DL method for compound potency predictions is introduced and compared to state-of-the-art ML models for the prediction of potent compounds. Finally, alternative evaluation schemes are explored, and possible future steps toward better benchmark systems for ML potency predictions are discussed. Taken together, this thesis uncovers current limitations of benchmark systems for comparing ML models and offers alternative approaches to better determine compound potency prediction performance.},
url = {https://hdl.handle.net/20.500.11811/12941}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: