Chemical Language Models for Molecular Design

Chen, Hengwei

dc.contributor.advisor	Bajorath, Jürgen
dc.contributor.author	Chen, Hengwei
dc.date.accessioned	2025-08-06T07:20:50Z
dc.date.available	2025-08-06T07:20:50Z
dc.date.issued	06.08.2025
dc.identifier.uri	https://hdl.handle.net/20.500.11811/13310
dc.description.abstract	In drug discovery, chemical language models (CLMs) inspired by natural language processing (NLP) provide innovative solutions for molecular design. CLMs learn the vocabulary, syntax, and conditional probabilities of molecular representations, enabling various sequence-to-sequence mappings. Leveraging neural language architectures, particularly transformers with multi-head self-attention and parallel processing, CLMs effectively handle diverse sequence types, enabling efficient training and molecular translation tasks. Their versatility in machine translation and property conditioning opens new opportunities for generative molecular design. This dissertation investigates the development and application of chemical and biochemical language models (LMs) for various medicinal chemistry and drug design challenges, including activity cliff (AC) prediction, highly potent compound design, analogue series extension, and active compound generation from protein sequences. The first project focused on conditional transformers (DeepAC) for predicting ACs and designing new AC compounds. During pre-training, the models learned source-to-target compound mappings from diverse activity classes, conditioned on potency differences caused by structural modifications. Fine-tuning enabled accurate generation of target compounds satisfying potency constraints bridging between predictive modeling and compound design. The subsequent study generalized predictions beyond ACs to design highly potent compounds from weakly potent templates across unseen activity classes. Further, the next study incorporated meta-learning, enabling effective generative design even in low-data regimes. Building on these predictive capabilities, the second project developed the DeepAS models for iterative analogue series (AS) extension in lead optimization. The initial DeepAS model predicted substituents for AS arranged by ascending potency, successfully reproducing AS across various targets from which the terminal (most potent) analogue was removed. DeepAS 2.0 expanded this approach to multi-site AS extension using a BERT-based architecture, while DeepAS 3.0 integrated structure–activity relationship matrix (SARM) formalism, enabling core modifications in AS with multiple substitution sites. The final project extended CLMs into the biochemical domain by developing a dual-component LM combining a pre-trained protein language model (PLM) with a conditional CLM. This model learned mappings from protein sequence embeddings conditioned on potency to active compounds; it consistently reproduced known compounds with varying potency across various activity classes not encountered during training. Additionally, the biochemical LM generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Taken together, this thesis highlights the promising capability of CLMs to address previously challenging or unfeasible prediction scenarios in molecular design, providing new opportunities for advancing in medicinal chemistry and drug discovery.	en
dc.language.iso	eng
dc.rights	In Copyright
dc.rights.uri	http://rightsstatements.org/vocab/InC/1.0/
dc.subject	chemical language models
dc.subject	molecular generative design
dc.subject	transformer
dc.subject.ddc	004 Informatik
dc.title	Chemical Language Models for Molecular Design
dc.type	Dissertation oder Habilitation
dc.publisher.name	Universitäts- und Landesbibliothek Bonn
dc.publisher.location	Bonn
dc.rights.accessRights	openAccess
dc.identifier.urn	https://nbn-resolving.org/urn:nbn:de:hbz:5-84329
dc.relation.doi	https://doi.org/10.1039/D2DD00077F
dc.relation.doi	https://doi.org/10.1038/s41598-023-34683-x
dc.relation.doi	https://doi.org/10.1038/s41598-023-43046-5
dc.relation.doi	https://doi.org/10.1039/D4MD00423J
dc.relation.doi	https://doi.org/10.1021/acs.jcim.4c01781
dc.relation.doi	https://doi.org/10.1186/s13321-024-00852-x
ulbbn.pubtype	Erstveröffentlichung
ulbbnediss.affiliation.name	Rheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.location	Bonn
ulbbnediss.thesis.level	Dissertation
ulbbnediss.dissID	8432
ulbbnediss.date.accepted	31.07.2025
ulbbnediss.institute	Zentrale wissenschaftliche Einrichtungen : Bonn-Aachen International Center for Information Technology (b-it)
ulbbnediss.fakultaet	Mathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coReferee	Vogt, Martin
ulbbnediss.contributor.orcid	https://orcid.org/0009-0008-5678-0874

Files in this item

Name:: 8432.pdf
Size:: 18.5MB
Format:: PDF

View/Open

This item appears in the following Collection(s)

E-Dissertationen (4424)

Show simple item record

The following license files are associated with this item: