Chemical Language Models for Molecular Design

Chen, Hengwei

Volltext

Dokument öffnen (18.5MB)

Autor

Chen, Hengwei

ORCID

https://orcid.org/0009-0008-5678-0874

Art der Hochschulschrift

Dissertation

Prüfungsdatum

31.07.2025

Datum der Veröffentlichung

06.08.2025

Erstgutachter

Bajorath, Jürgen

Zweitgutachter

Vogt, Martin

Grad-verleihende Institutionen

Rheinische Friedrich-Wilhelms-Universität Bonn

Metadaten

Zur Langanzeige

Zitierbare Links

Handle: https://hdl.handle.net/20.500.11811/13310
URN: https://nbn-resolving.org/urn:nbn:de:hbz:5-84329

Inhalt

In drug discovery, chemical language models (CLMs) inspired by natural language processing (NLP) provide innovative solutions for molecular design. CLMs learn the vocabulary, syntax, and conditional probabilities of molecular representations, enabling various sequence-to-sequence mappings. Leveraging neural language architectures, particularly transformers with multi-head self-attention and parallel processing, CLMs effectively handle diverse sequence types, enabling efficient training and molecular translation tasks. Their versatility in machine translation and property conditioning opens new opportunities for generative molecular design. This dissertation investigates the development and application of chemical and biochemical language models (LMs) for various medicinal chemistry and drug design challenges, including activity cliff (AC) prediction, highly potent compound design, analogue series extension, and active compound generation from protein sequences. The first project focused on conditional transformers (DeepAC) for predicting ACs and designing new AC compounds. During pre-training, the models learned source-to-target compound mappings from diverse activity classes, conditioned on potency differences caused by structural modifications. Fine-tuning enabled accurate generation of target compounds satisfying potency constraints bridging between predictive modeling and compound design. The subsequent study generalized predictions beyond ACs to design highly potent compounds from weakly potent templates across unseen activity classes. Further, the next study incorporated meta-learning, enabling effective generative design even in low-data regimes. Building on these predictive capabilities, the second project developed the DeepAS models for iterative analogue series (AS) extension in lead optimization. The initial DeepAS model predicted substituents for AS arranged by ascending potency, successfully reproducing AS across various targets from which the terminal (most potent) analogue was removed. DeepAS 2.0 expanded this approach to multi-site AS extension using a BERT-based architecture, while DeepAS 3.0 integrated structure–activity relationship matrix (SARM) formalism, enabling core modifications in AS with multiple substitution sites. The final project extended CLMs into the biochemical domain by developing a dual-component LM combining a pre-trained protein language model (PLM) with a conditional CLM. This model learned mappings from protein sequence embeddings conditioned on potency to active compounds; it consistently reproduced known compounds with varying potency across various activity classes not encountered during training. Additionally, the biochemical LM generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Taken together, this thesis highlights the promising capability of CLMs to address previously challenging or unfeasible prediction scenarios in molecular design, providing new opportunities for advancing in medicinal chemistry and drug discovery.

Schlagwörter

chemical language models, molecular generative design, transformer

Klassifikation (DDC)

004 Informatik

Zugehörige Publikation(en)

https://doi.org/10.1039/D2DD00077F
https://doi.org/10.1038/s41598-023-34683-x
https://doi.org/10.1038/s41598-023-43046-5
https://doi.org/10.1039/D4MD00423J
https://doi.org/10.1021/acs.jcim.4c01781
https://doi.org/10.1186/s13321-024-00852-x

Zitiervorschlag
BibTeX

Chen, Hengwei: Chemical Language Models for Molecular Design. - Bonn, 2025. - Dissertation, Rheinische Friedrich-Wilhelms-Universität Bonn.
Online-Ausgabe in bonndoc: https://nbn-resolving.org/urn:nbn:de:hbz:5-84329

@phdthesis{handle:20.500.11811/13310,
urn: https://nbn-resolving.org/urn:nbn:de:hbz:5-84329,
author = {{Hengwei Chen}},
title = {Chemical Language Models for Molecular Design},
school = {Rheinische Friedrich-Wilhelms-Universität Bonn},
year = 2025,
month = aug,
note = {In drug discovery, chemical language models (CLMs) inspired by natural language processing (NLP) provide innovative solutions for molecular design. CLMs learn the vocabulary, syntax, and conditional probabilities of molecular representations, enabling various sequence-to-sequence mappings. Leveraging neural language architectures, particularly transformers with multi-head self-attention and parallel processing, CLMs effectively handle diverse sequence types, enabling efficient training and molecular translation tasks. Their versatility in machine translation and property conditioning opens new opportunities for generative molecular design. This dissertation investigates the development and application of chemical and biochemical language models (LMs) for various medicinal chemistry and drug design challenges, including activity cliff (AC) prediction, highly potent compound design, analogue series extension, and active compound generation from protein sequences. The first project focused on conditional transformers (DeepAC) for predicting ACs and designing new AC compounds. During pre-training, the models learned source-to-target compound mappings from diverse activity classes, conditioned on potency differences caused by structural modifications. Fine-tuning enabled accurate generation of target compounds satisfying potency constraints bridging between predictive modeling and compound design. The subsequent study generalized predictions beyond ACs to design highly potent compounds from weakly potent templates across unseen activity classes. Further, the next study incorporated meta-learning, enabling effective generative design even in low-data regimes. Building on these predictive capabilities, the second project developed the DeepAS models for iterative analogue series (AS) extension in lead optimization. The initial DeepAS model predicted substituents for AS arranged by ascending potency, successfully reproducing AS across various targets from which the terminal (most potent) analogue was removed. DeepAS 2.0 expanded this approach to multi-site AS extension using a BERT-based architecture, while DeepAS 3.0 integrated structure–activity relationship matrix (SARM) formalism, enabling core modifications in AS with multiple substitution sites. The final project extended CLMs into the biochemical domain by developing a dual-component LM combining a pre-trained protein language model (PLM) with a conditional CLM. This model learned mappings from protein sequence embeddings conditioned on potency to active compounds; it consistently reproduced known compounds with varying potency across various activity classes not encountered during training. Additionally, the biochemical LM generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Taken together, this thesis highlights the promising capability of CLMs to address previously challenging or unfeasible prediction scenarios in molecular design, providing new opportunities for advancing in medicinal chemistry and drug discovery.},
url = {https://hdl.handle.net/20.500.11811/13310}
}

Die folgenden Nutzungsbestimmungen sind mit dieser Ressource verbunden: