Machine learning and mixture clustering methods for molecular drug discovery: prediction and characterisation of drugs and druggable targets

Shafi, Shanjeeda

Title: Machine learning and mixture clustering methods for molecular drug discovery: prediction and characterisation of drugs and druggable targets
Creator: Shafi, Shanjeeda
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2021
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: In the drug discovery process, approximately five to ten thousand compounds are initially screened but only 1% of these enter the preclinical testing stage that determines whether the compound is safe, efficacious, and feasible to use for a disease state. Owing to regulatory, toxicity, resistance and human health concerns, demand is increasing for refinement of and intensive use of molecular physicochemical properties via effective and robust mathematical methods for drug discovery. Chemoinformatics is now a well-recognised discipline focused on searching, identifying and extracting meaningful information from chemical sequences and structures of compounds. A candidate drug is usually a small molecule (~50 atoms) that acts by many different mechanisms of protein. Every year, several drugs are discarded from the market owing to poor pharmacodynamic and pharmacokinetic properties, which motivates this study that attempts to clarify the factors that facilitate compounds to be drug-like. The druglikeness of a molecule is characterised in part by its satisfying Lipinski’s rule-of-five (Ro5) regarding its molecular properties, such as mass and hydrophobicity, which play an important role in oral absorption, distribution, metabolism and excretion. A debate has existed for some time and now accelerated in the industry as to what constitutes a good ‘hit’. Increasing evidence suggests that relying completely on Lipinski’s Ro5 for potential drug synthesis may increase the likelihood of future drug failures. Retrospective analysis of failed drug discovery projects and incorporation of beyond Ro5 rules may provide useful information in innovating drugs for difficult targets. There is an urgent need to develop reliable computational methods for predicting drug-likeness of candidate molecules to identify those unlikely to survive the later stages of discovery and development. Visualisation and machine learning methods are two common approaches to uncover underlying patterns in the pharmacological property space, so called chemo-space, for drug design. Thus far, drug-likeness has been studied from several viewpoints, and in this thesis, we use proposed druggability rules (Hudson et al. 2012, 2014, 2017) to determine cut points for each molecular predictor based on non-Bayesian mixture model-based clustering with discriminant analysis, MC/DA (MclustDA R package). we also used decision tree for choosing cut-off ranges of molecular descriptors. To date, Hudson et al.’s (2014, 2017) results have established an improved scoring function, beyond the cut points of the Ro5. In this thesis, mixture-based modeling (Bayesian and two non-Bayesian) tools are applied via different ‘R’ packages (Rmixmod, depmixS4 and mixAK), to identify good and poor drug candidates using a combination of 9 and 10 molecular physicochemical and structural properties and scoring functions of violations (Hudson et al. 2014, 2017). The non-Bayesian Gaussian mixture method (GMM) is shown to be optimal at classifying true good and poor molecules correctly in terms of Ro5, oral_Ro5 drug-like (Divide into two parts: oral_Ro5 drug-like status1 and oral_Ro5 drug-like status2), eRo5 (extended rule of 5) and bRo5 (beyond rule of five) drugs classification, as suggested recently by Lipinski (2014, 2016) and Doak et al. (2014, 2016). In the thesis, the GMM approach, and the optimal 10 descriptors (whether continuous and categorical) set model (based on the following molecular parameters- MW, logP, logD, Hydrogen bond donors and acceptors, polar surface area, number of atoms and rings, Halogen), shows good predictive performance, with Matthews correlation coefficient (C) values in the range of 0.41–0.58, compared with other descriptors set models using Bayesian (mixAK) and non-Bayesian (HMM) methods in terms of computational time and higher sensitivity, specificity and C values. The GMM classification identified 1013 drug-like molecules of which 4 % were in bRo5 space and 266 non drug-like molecules of which 38% were in bRo5 space, supporting recent trends to more outside the Ro5 region. These mixture models are formed the basis to identify molecules and disease targets in the chemo-space using visualisation methods such as Principal component analysis (PCA), Factor analysis for mixed data (FAMD) and Correspondence analysis (CA). These three visualisation and data reductive methods successfully identify a group of molecules and specific disease targets with a prescribed range of ADME properties in different quadrants in the chemo space. This work also demonstrates that PCA, MCA and FAMD methods could be a powerful technique for exploring complex datasets in drug discovery study to identify outliers. It is shown that both lipophilicity measurement descriptors logP and logD have a significant influence on the facilitation of compounds and DC’s segregations. Two non-Bayesian mixture clustering approaches, the Gaussian mixture method (GMM via Rmixmod) and the Hidden Markov model (HMM via depmixS4) as applied in this thesis permit capture of the global properties of molecules with related targets. Based on these mixture approaches, this study is identified disease targets using the score function and molecular physicochemical properties of drugs-towards target. All mixture clustering models are identified 9 poor/non-druggable and 26 good/druggable targets with the anti-bacterial and adrenergic targets identified as the topmost poor and good druggable target respectively. Furthermore, three popular machine learning (ML) methods, such as (1) recursive partitioning, (2) naïve Bayesian and (3) support vector machine technique was also used to discriminate drug-like and non grug-like molecules based on molecular descriptors. Among these ML techniques, the SVM model is superior in terms of different rule-based drugs classifications and achieved a sensitivity range of 94% to 99% and specificity range of 84% to 100%, likewise exhibiting higher C values 0.68 to 0.99. The three-mixture based clustering with classification analyses results which use both LogD and logP are offering an excellent opportunity to consider these lipophilicity measurement descriptors (logP and logD) in conjunction with other descriptors to help predict permeability and solubility of active compounds in drug discovery. This study has the potential to significantly reduce the false classification of drugs and suggest an appropriate predictor set to help identify for new drug innovations.
Subject: preclinical testing; physicochemical properties; machine learning; cut point; Bayesian and non-Bayesian; Matthews correlation coefficient; computational approach; diseases target; score function; lipophilicity; permeability; solubility; pharmacodynamic; predictor set; pharmacokinetic; ADME; druggability; Ro5; difficult target; drug-likeness; visualization
Identifier: http://hdl.handle.net/1959.13/1431097
Identifier: uon:38918
Language: eng
Full Text

Hits: 2640
Visitors: 2881
Downloads: 279

		Thumbnail	File	Description	Size	Format
View Details Download			ATTACHMENT01	Thesis	8 MB	Adobe Acrobat PDF	View Details Download
View Details Download			ATTACHMENT02	Abstract	357 KB	Adobe Acrobat PDF	View Details Download