FROM SIMULATIONS TO LANGUAGE MODELS: COMPUTATIONAL INNOVATIONS FOR CHEMICAL BIOLOGY AND DRUG DISCOVERY

Loading...
Thumbnail Image

Degree type

Doctor of Philosophy (PhD)

Graduate group

Chemistry

Discipline

Chemistry
Biochemistry, Biophysics, and Structural Biology

Subject

AI
Biophysics
Chemical biology
Machine learning
Proteins

Funder

Grant number

License

Copyright date

2023

Distributor

Related resources

Contributor

Abstract

The exponential growth of computational power, coupled with rapid advancements in artificial intelligence and machine learning, has brought about profound transformations across various disciplines. This thesis addresses the challenges of applying these techniques in the domains of chemical biology and drug discovery, where scientific datasets present unique characteristics, distinct from the vast and standardized datasets prevalent in major technology companies. These datasets are characterized by their small size, lack of standardization, and heteroskedastic errors arising from compilation from diverse sources and various experiments. Given the inherent time and resource demand of scientific research, there is a pressing need for computational strategies that effectively leverage these datasets to expedite scientific discovery. This thesis contributes innovative computational methods for different datatypes, including tables, graphs, and sequences. Initially, we pioneered a simulation-based machine learning strategy, successfully applying it in three chemical biology projects to predict ΔΔG of mutations at protein-protein interfaces, proteolytic resistance of thioamide-containing peptides, and solubility of unnatural amino acid-containing proteins. Moving in a different but related direction, this thesis introduces a novel deep learning strategy called "hint token learning" for large language modeling. This approach effectively highlights relevant information from mutant protein sequences which can differ by as little as a single token from the wild-type sequence. Furthermore, we introduce a novel chemically informed graph downsampling technique for deep learning on chemical datasets, enabling enhanced analysis and prediction of protein-ligand binding interactions by capturing intricate relationships within molecular structures. Collectively, these advancements contribute to the development of computational strategies tailored to the unique characteristics of chemical biology and drug discovery datasets and ultimately aim to expedite experimental research in these areas.

Date of degree

2023

Date Range for Data Collection (Start Date)

Date Range for Data Collection (End Date)

Digital Object Identifier

Series name and number

Volume number

Issue number

Publisher

Publisher DOI

relationships.isJournalIssueOf

Comments

Recommended citation