Show simple item record

Chemical Language Models for Molecular Design

dc.contributor.advisorBajorath, Jürgen
dc.contributor.authorChen, Hengwei
dc.date.accessioned2025-08-06T07:20:50Z
dc.date.available2025-08-06T07:20:50Z
dc.date.issued06.08.2025
dc.identifier.urihttps://hdl.handle.net/20.500.11811/13310
dc.description.abstractIn drug discovery, chemical language models (CLMs) inspired by natural language processing (NLP) provide innovative solutions for molecular design. CLMs learn the vocabulary, syntax, and conditional probabilities of molecular representations, enabling various sequence-to-sequence mappings. Leveraging neural language architectures, particularly transformers with multi-head self-attention and parallel processing, CLMs effectively handle diverse sequence types, enabling efficient training and molecular translation tasks. Their versatility in machine translation and property conditioning opens new opportunities for generative molecular design. This dissertation investigates the development and application of chemical and biochemical language models (LMs) for various medicinal chemistry and drug design challenges, including activity cliff (AC) prediction, highly potent compound design, analogue series extension, and active compound generation from protein sequences. The first project focused on conditional transformers (DeepAC) for predicting ACs and designing new AC compounds. During pre-training, the models learned source-to-target compound mappings from diverse activity classes, conditioned on potency differences caused by structural modifications. Fine-tuning enabled accurate generation of target compounds satisfying potency constraints bridging between predictive modeling and compound design. The subsequent study generalized predictions beyond ACs to design highly potent compounds from weakly potent templates across unseen activity classes. Further, the next study incorporated meta-learning, enabling effective generative design even in low-data regimes. Building on these predictive capabilities, the second project developed the DeepAS models for iterative analogue series (AS) extension in lead optimization. The initial DeepAS model predicted substituents for AS arranged by ascending potency, successfully reproducing AS across various targets from which the terminal (most potent) analogue was removed. DeepAS 2.0 expanded this approach to multi-site AS extension using a BERT-based architecture, while DeepAS 3.0 integrated structure–activity relationship matrix (SARM) formalism, enabling core modifications in AS with multiple substitution sites. The final project extended CLMs into the biochemical domain by developing a dual-component LM combining a pre-trained protein language model (PLM) with a conditional CLM. This model learned mappings from protein sequence embeddings conditioned on potency to active compounds; it consistently reproduced known compounds with varying potency across various activity classes not encountered during training. Additionally, the biochemical LM generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Taken together, this thesis highlights the promising capability of CLMs to address previously challenging or unfeasible prediction scenarios in molecular design, providing new opportunities for advancing in medicinal chemistry and drug discovery.en
dc.language.isoeng
dc.rightsIn Copyright
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/
dc.subjectchemical language models
dc.subjectmolecular generative design
dc.subjecttransformer
dc.subject.ddc004 Informatik
dc.titleChemical Language Models for Molecular Design
dc.typeDissertation oder Habilitation
dc.publisher.nameUniversitäts- und Landesbibliothek Bonn
dc.publisher.locationBonn
dc.rights.accessRightsopenAccess
dc.identifier.urnhttps://nbn-resolving.org/urn:nbn:de:hbz:5-84329
dc.relation.doihttps://doi.org/10.1039/D2DD00077F
dc.relation.doihttps://doi.org/10.1038/s41598-023-34683-x
dc.relation.doihttps://doi.org/10.1038/s41598-023-43046-5
dc.relation.doihttps://doi.org/10.1039/D4MD00423J
dc.relation.doihttps://doi.org/10.1021/acs.jcim.4c01781
dc.relation.doihttps://doi.org/10.1186/s13321-024-00852-x
ulbbn.pubtypeErstveröffentlichung
ulbbnediss.affiliation.nameRheinische Friedrich-Wilhelms-Universität Bonn
ulbbnediss.affiliation.locationBonn
ulbbnediss.thesis.levelDissertation
ulbbnediss.dissID8432
ulbbnediss.date.accepted31.07.2025
ulbbnediss.instituteZentrale wissenschaftliche Einrichtungen : Bonn-Aachen International Center for Information Technology (b-it)
ulbbnediss.fakultaetMathematisch-Naturwissenschaftliche Fakultät
dc.contributor.coRefereeVogt, Martin
ulbbnediss.contributor.orcidhttps://orcid.org/0009-0008-5678-0874


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

The following license files are associated with this item:

InCopyright