qsar phd thesis pdf

Cookie Acknowledgement

This website uses cookies to collect information to improve your browsing experience. Please review our Privacy Statement for more information.

Administration
Toggle Search
Find People

Machine Learning Algorithms for QSPR/QSAR Predictive Model Development Involving High-Dimensional Data

Type of degree.

Chemical Engineering

With advancements in fields such as computational chemistry, computer-aided molecular design and chemoinformatics, the scientific community has now become inundated with a very large set of molecular descriptors. The advantage of availability of large set of descriptors is that computational modelers can now capture different characteristics of molecules of varying sizes in different solvent/reaction mediums. However, the drawback is that during model development, the number of descriptors can exceed the number of instances in a dataset. Such datasets are known as high-dimensional data matrix. This is especially the case when the process of data generation is complex, time-consuming and/or resource intensive. Apart from these reasons, this can also happen when a specific product needs to be developed for a very specific use (e.g. drugs for a specific physical condition, polymers of a specific property, reaction in a specific environment). These cases tend to be very condition-specific, e.g. type of chemical species, activities or responses in specific environment, temperature, pressure, etc. The challenges of modeling such cases include but are not limited to; difficulty of generating a generalizable model, large model uncertainty and overfitting of model(s) generated. To address the aforementioned drawbacks and ensuing challenges, in this work, we have developed hybrid algorithms which are efficient and can generate generalizable models. These algorithms overcome the disadvantage of traditional modeling techniques that break down when the number of descriptors exceed the sample size. The developed algorithms, in our work, can be incorporated in software platforms, useful for automated design of product-centric industrial processes. Such software should be capable of analyzing experimental data and generating the best possible molecular structure for the specific constraints and objectives. It is also required to be fast and accurate at the same time. In the past, such situations were tackled with ab initio calculations, later replaced by DFT (Density Function Theory) based calculations. Apart from being computationally expensive, such methods include problems of manual handling of data for molecular design operations. To address such limitations, molecular descriptors (0D-7D) became attractive alternatives. However, the complexity of the calculation of descriptors increases with the complexity of the molecular structure. 2D (2 dimensional) descriptors, such as connectivity index descriptors, have been proven to be efficient in model generation with significant accuracy. Also, the design calculation steps are not computationally expensive. For these reasons, in this work, the generated models are based on 2D molecular descriptors. In this work, two unique condition-specific situations have been discussed. Case 1 encompasses relating reactant and solvent structures to the reaction rate constants for Diels Alder reactions. As reaction rates are more prone to depend of inter-atom connectivity, connectivity index descriptors were used to develop this model. A hybrid GA-DT (Genetic Algorithm-Decision Tree) algorithm was developed to select features and for model development. This case is unique as it involves the study of three different chemical species while generating the predictive model, and hence a challenge for both traditional and newly developed hybrid algorithms. Further improvements for the model were proposed using Multi-Gene Genetic Programming (MGGP) algorithm to derive non-linear models. Case 2 is based on developing a model to relate structures of 9-Anilinoacridine derivatives with respective DNA-drug binding affinity values. Although this case has only one group of chemical species under consideration, challenges emerge when two or more models with similar metrics are generated. Although the genetic algorithm was used for feature selection, initially, a novel adaptive version of LASSO (Least Absolute Shrinkage and Selection Operator) algorithm was developed. This adaptive correlation-based LASSO (CorrLASSO) was used to perform regression and shrinkage calculations. To evaluate model fitness, R2 and Q2 values were calculated that represent model internal and external validation respectively. For the second case, mean square error (MSE) was also calculated to compare the performances of LASSO and CorrLASSO algorithm.

http://hdl.handle.net/10415/6573

Methodology Article
Open access
Published: 26 October 2019

Comprehensive ensemble in QSAR prediction for drug discovery

Sunyoung Kwon 1 , 3 na1 ,
Jeonghee Jo 2 na1 &
Sungroh Yoon ORCID: orcid.org/0000-0002-2367-197X 1 , 2 , 4 , 5 , 6

BMC Bioinformatics volume 20 , Article number: 521 ( 2019 ) Cite this article

50k Accesses

119 Citations

2 Altmetric

Metrics details

Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions. Ensemble learning builds a set of diversified models and combines them. However, the most prevalent approach random forest and other ensemble approaches in QSAR prediction limit their model diversity to a single subject.

The proposed ensemble method consistently outperformed thirteen individual models on 19 bioassay datasets and demonstrated superiority over other ensemble approaches that are limited to a single subject. The comprehensive ensemble method is publicly available at http://data.snu.ac.kr/QSAR/ .

Conclusions

We propose a comprehensive ensemble method that builds multi-subject diversified models and combines them through second-level meta-learning. In addition, we propose an end-to-end neural network-based individual classifier that can automatically extract sequential features from a simplified molecular-input line-entry system (SMILES). The proposed individual models did not show impressive results as a single model, but it was considered the most important predictor when combined, according to the interpretation of the meta-learning.

Quantitative structure-activity relationship (QSAR) is a computational or mathematical modeling method to reveal relationships between biological activities and the structural properties of chemical compounds. The underlying principle is that variations in structural properties cause different biological activities [ 1 ]. Structural properties refer to physico-chemical properties, and biological activities correspond to pharmacokinetic properties such as absorption, distribution, metabolism, excretion, and toxicity.

QSAR modeling helps prioritize a large number of chemicals in terms of their desired biological activities as an in silico methodology and, as a result, significantly reduces the number of candidate chemicals to be tested with in vivo experiments. QSAR modeling has served as an inevitable process in the pharmaceutical industry, but many constraints are involved [ 2 , 3 ]. QSAR data may involve a very large number of chemicals (more than hundreds of thousands); each chemical can be represented by a variety of descriptors; commonly used fingerprints are very sparse (most of the values are zero), and some features are highly correlated; it is assumed that the dataset contains some errors because relationships are assessed through in situ experiments.

Due to these constraints, it has become difficult for QSAR-based model prediction to achieve a reliable prediction score. Consequently, machine learning approaches have been applied to QSAR prediction. Linear regression models [ 4 ] and Bayesian neural networks [ 5 – 7 ] have been used for QSAR prediction. Random forest (RF) [ 8 , 9 ] is most commonly used algorithm with a high level of predictability, simplicity, and robustness. RF is a kind of ensemble method based on multiple decision trees that can prevent the overfitting from a single decision tree. RF is considered to be the gold standard in this field [ 2 ]; thus, newly proposed QSAR prediction methods ofen have their performance compared to RF.

The Merck Kaggle competition in 2012 turned people’s attentions to neural networks. The winning team used multi-task neural networks (MTNNs) [ 10 ]. The fundamental learning structure is based on plain feed-forward neural networks; it avoids overfitting by learning multiple bioassays simultaneously. The team obtained results that consistently outperformed RF. Despite achieving high performance using a multi-task neural network, the team ultimately used an ensemble that combined different methods.

Both RF and the aforementioned technique from the Kaggle competition used ensemble learning, a technique which builds a set of learning models and combines multiple models to produce final predictions. Theoretically and empirically, it has been shown that the predictive power of ensemble learning surpasses that of a single individual learner if the individual algorithms are accurate and diverse [ 11 – 14 ]. Ensemble learning manages the strengths and weaknesses of individual learners, similar to how people consider diverse opinions when faced with critical issues.

Ensemble methods, including neural network ensemble based on bootstrap sampling in QSAR ( data sampling ensemble ) [ 15 ]; ensemble against different learning methods for drug-drug interaction [ 16 ], Bayesian ensemble model with different QSAR tools ( method ensemble ) [ 7 ], ensemble learning based qualitative and quantitative SAR models [ 17 ], Hybrid QSAR prediction model with various learning methods [ 18 ], ensembles with different boosting methods [ 19 ], Hybridizing feature selection and feature learning in QSAR modeling [ 20 ], and ensemble against diverse chemicals for carcinogenicity prediction ( representation ensembles ) [ 21 ] have been extensively used in drug (chemical) research. However, these ensemble approaches limit model diversity to a single subject, such as data sampling, method, and input representation (drug-specific).

To overcome this limitation, we propose a multi-subject comprehensive ensemble with a new type of individual classifier based on 1D-CNNs and RNNs. The detailed key characteristics and contributions of our proposed methods are as follows:

Instead of limiting ensemble diversity to a single subject, we combine multi-subject individual models comprehensively. This ensemble is used for combinations of bagging, methods, and chemical compound input representations.

We propose a new type of individual QSAR classifier that is an end-to-end neural network model based on one-dimensional convolutional neural networks (1D-CNNs) and recurrent neural networks (RNNs). It automatically extracts sequential features from a simplified molecular-input line-entry system (SMILES).

We combine a set of models using second-level combined learning (meta-learning) and provide an interpretation regarding the importance of individual models through their learned weights.

To validate our proposed method, we tested 19 bioassays specified in [ 10 ]. In our experiments, we confirmed the superiority of our proposed method by comparing individual models, limited ensemble approaches, and other combining techniques. Further, we identified the importance of the proposed end-to-end individual classifier through an interpretation of second-level meta-learning.

Experimental setup

A bioassay is a biochemical test to determine or estimate the potency of a chemical compound on targets and has been used for a variety of purposes, including drug development, and environmental impact analysis. In our experiment, we used 19 bioassays downloaded from the PubChem open chemistry database [ 22 ], which are listed in Table 1 . All bioassays are those specified in [ 10 ]. The purpose of the paper was to address multi-task effects; thus, a number of experimental assays are closely related, such as the 1851, 46321*, 48891*, and 6517** series.

From each bioassay, we extracted a PubChem chemical ID and activity outcome (active or inactive). We only used duplicate chemicals once, and we excluded inconsistent chemicals that had both active and inactive outcomes. A class imbalance ratio between active and inactive ranged from 1:1.1 to 1:4.2 depending on the dataset; most bioassays are imbalanced, with an average ratio of 1:2.

Representation of chemical compounds

In our experiment, we used three types of molecular fingerprints PubChem [ 22 ], ECFP [ 23 ], MACCS [ 24 ], and string type SMILES [ 25 ]. Because SMILES is a sequential string type descriptor, it is not a proper form for conventional learning methods. We used an end-to-end 1D-CNN and RNN which are capable of handling a sequential forms. On the other hand, a binary vector type fingerprint consists of 1’s and 0’s in a form of non-sequential form. Thus, conventional machine learning approaches such as plain feed-forward neural network are used.

The SMILES and PubChem fingerprint were retrieved from the preprocessed chemical IDs using PubChemPy [ 26 ], and ECFP and MACCS fingerprints were retrieved from SMILES using RDKit [ 27 ].

Experimental configuration and environment

We followed the same experimental settings and performance measures as described for the multi-task neural network [ 10 ]. We randomly divided the dataset into two parts: 75% of the dataset was used as a training set, and the other 25% was used as a testing set. The training dataset was also randomly partitioned into five portions: one for validation, and the remaining four for training (5-fold cross-validation). The prediction probabilities from the 5-fold validations were concatenated as P , and were then used as inputs for the second-level learning.

We ran our experiments on Ubuntu 14.04 (3.5GHz Intel i7-5930K CPU and GTX Titan X Maxwell(12GB) GPU). We used the Keras library package (version 2.0.6) for neural network implementation, the Scikit-learn library package (version 0.18) for conventional machine learning methods, and PubChemPy (version 1.0.3) and RDKit (version 1.0.3) for input representation preparation of the chemical compounds.

Performance comparison with other approaches

Performance comparison with individual models.

We compared our comprehensive ensemble method with 13 individual models: the 12 models from the combination of three types of fingerprints (PubChem, ECFP, and MACCS) and four types of learning methods (RF, SVM, GBM, and NN), and a SMILES-NN combination.

As shown in Table 2 , the comprehensive ensemble showed the best performance across all datasets, followed by ECFP-RF and PubChem-RF. We can see that the top-3 AUCs (represented in bold) are dispersed across the chemical compound representations and learning methods, except for PubChem-SVM, ECFP-GBM, and MACCS-SVM. The individual SMILES-NN models were within the top-3 ranks of the three datasets. In terms of learning methodology, RF showed the highest number of top-3 AUC values followed by NN, GBM, and SVM. In terms of chemical compound representation, ECFP showed the highest number of top-3 AUC values followed by PubChem, SMILES (compared proportionally), and MACCS. In terms of the averaged AUC, the comprehensive ensemble showed the best performance (0.814), followed by ECFP-RF (0.798) and PubChem-RF (0.794). The MACCS-SVM combination showed the lowest AUC value (0.736). Aside from the best (proposed ensemble) and the worst (MACCS-SVM) methods, all average AUC values were less than 0.80. Predictability depends on the combination of learning method and input representation. Although SVM showed better performance than GBM in ECFP, GBM showed better performance than SVM in MACCS.

Statistical analysis with paired t -tests was performed to evaluate differences between the means of paired outcomes. The AUC scores of the comprehensive ensembles were compared with the top-scored AUC from the individual classifier in each dataset from the five fold cross-validation. Assuming that two output scores y 1 and y 2 follow normal distributions, the difference between these two scores should also follow a normal distribution. The null hypothesis of no difference between the means of two output scores, calculated as d = y 1 − y 2 , indicates that the distribution of this difference has mean 0 and variance \(\sigma ^{2}_{d}\) . The comprehensive ensemble achieved an AUC score exceeding the top-scored AUC from an individual classifier in 16 out of 19 PubChem bioassays as shown in Table 3 . Let \(\bar {d}, s_{d}\) , n denote the mean difference, the standard deviation of the differences, and the number of samples, respectively. The results are significant at a p-value of 8.2×10 −7 , where the t value is calculated by \(t_{d} = \frac {\bar {d}} {\frac {s_{d}}{\sqrt {n}}} \sim t_{n-1}.\)

Performance comparison with other ensemble approaches

In addition to a comparison with individual models, we compared the proposed ensemble method with other ensemble approaches based on the ensemble subject and combining technique, as shown in Table 4 .

The first three columns showe the method ensemble, which combines predictions from RF, SVM, GBM, and NN by fixing them to a particular chemical representation. The ensembles based on PubChem, ECFP, and MACCS showed AUC values of 0.793, 0.796, and 0.784, which are 0.016, 0.015, and 0.018 higher than the average AUC value for the four individual methods based on those representations, respectively. The next five columns show the representation ensembles, which combine the PubChem, ECFP, and MACCS molecular representations by fixing them to a particular learning method. As with the method ensembles, the representation ensembles outperformed the average results from the individual representation models based on their learning methods. In particular, the NN-based individual models showed lower AUCs values than the RF-based models, but the NN-based combined representation ensemble showed a higher AUC value than the RF-based ensemble.

Bagging is an easy-to-develop and powerful technique for class imbalance problems [ 28 ]. Figure 1 a shows the effectiveness of bagging by comparing a plain neural network (NN) with a bootstrap aggregated neural network (NN-bagging) and a neural network-based representation ensemble (NN-representation ensemble). As shown in Fig. 1 a, bagging improved the AUC in both ensemble techniques. As shown in Fig. 1 b, the improved AUC by bagging was correlated with the imbalance ratio of the dataset (Pearson’s r=0.69, p-value= 1.1×10 −3 ). The results showed greater improvement with a higher imbalance ratio.

Ensemble effects on class-imbalanced datasets. a Improved average AUC value produced by neural network bagging (NN-bagging) and neural network-based representation ensemble (NN-representation ensemble) over three fingerprints. b Pearson’s correlation (r=0.69, p-value=1.1x 10 −3 ) between the improved AUC values from NN-bagging and the class imbalance ratio. The class imbalance ratio was calculated from the number of active and inactive chemicals, as shown in Table 1

The proposed multi-subject comprehensive ensemble combines all models regardless of learning method or representation: 12 models consisting of the unique combinations of representations (PubChem, ECFP, and MACCS) and learning methods (RF, SVM, GBM, and NN) and the newly proposed SMILES-NN model. All ensembles except for the last column combined the various models by uniform averaging. The comprehensive ensemble outperformed all limited ensemble approaches based on average combining.

In terms of the combination technique, we compared simple uniform averaging with the proposed meta-learning techniques in both comprehensive ensembles. The results of the comprehensive ensemble from Table 2 are presented in the second to the last column of Table 4 . The last column in Table 4 shows the performance comparison between meta-learning and the comprehensive ensemble. The multi-task neural networks [ 10 ] achieved state-of-the-art performance on 19 PubChem bioassays with performance measurement of the AUC. As shown in Table 5 , our approach outperformed multi-task learning in 13 out of 19 PubChem bioassays. From “ Convolutional and recurrent neural networks ” section, this result was statistically significant at a p-value of 3.9×10 −8 in 13 out of 19 datasets and resulted in a higher mean AUC value for the meta-learning network than for the multi-task network.

Performance comparison on other dataset

The Drug Therapeutics Program (DTP) AIDS Antiviral Screen developed an HIV dataset for over 40,000 compounds. These results are categorized into three groups: confirmed inactive (CI), confirmed active (CA) and confirmed moderately active (CM). Following previous research [ 29 ], we also combined the latter two labels (CA and CM), resulting it a classification task to discriminate inactive and active.

We evaluated our meta-learning neural network on the HIV dataset following identical experimental settings as described in MoleculeNet [ 29 ]. The HIV dataset was divided by scaffold-based splitting into training, validation, and test sets at a ratio of 80:10:10. Scaffold-based splitting separates structurally different molecules into different subgroups [ 29 ]. For the performance metrics, we used AU-ROC, accuracy, Matthews correlation coefficient (MCC), and F1-score. Accuracy, MCC, and F1-score were defined as follows:

where TP , FP , FN , and TN represent the number of true positives, false positives, false negatives, and true negatives, respectively. Table 6 shows the results for the comparison between multi-task [ 10 ] and meta-learning on the various performance metrics. For meta-learning, we applied our neural networks described in Section 2.3.4 to the multi-task neural network. We repeated the experiments 100 times and calculated the mean test score. In terms of AU-ROC, both neural networks performed similarly, however, meta-learning outperformed multi-task learning in other metrics.

Meta-learning and interpretation of model importance

We made a final decision through meta-learning using the predictions from independent first-level models as input. Any learning algorithm could be used as a meta-learner. We used SVM, which achieved the highest average AUC value in further experiments compared with NN, RF, GBM, and ordinary regression.

We interpreted the importance of the models through their learned weights. In the process of meta-learning, a weight is assigned to each model, and this weight could be interpreted as the model importance. As shown in Fig. 2 , the degree of darkness for each method is slightly different depending on the dataset, just as the best prediction method and representation depends on the datasets (Table 2 ). A darker color indicates a higher weight and importance. PubChem-SVM, ECFP-GBM, and MACCS-SVM showed low importance, while SMILES-NN and ECFP-RF showed high importance throughout the dataset. The SMILES-NN model did not show as high a performance as an individual model, but it was regarded as the most important model.

Interpretation of model importance through meta-learning. Weights through meta-learning were used to interpret model importance. Darker green indicates a highly weighted and significant model, while lighter yellow indicates a less weighted and less significant model

Ensemble learning can improve predictability, but it requires a set of diversified hypotheses; bagging requires a set of randomly sampled datasets, a method ensemble needs to exploit diverse learning methods, and a representation ensemble needs to prepare diversified input representations. A comprehensive ensemble requires diversified datasets, methods, and representations across multi-subjects; thus, it has difficulties in preparation and learning efficiency for these hypotheses.

Diversity is a crucial condition for ensemble learning. RF was superior to NN among the individual models, but NN outperformed RF in the representation ensemble. This is presumably due to model variation diversities caused by random initialization and random dropout of the neural network. In addition to model variation diversity, SMILES seems to contribute to ensemble representation diversity. The SMILES-based model did not show impressive results as an individual model, but it was considered the most important predictor when combined.

The proposed comprehensive ensemble exploits diversities across multi-subjects and exhibits improved predictability compared to the individual models. In particular, the neural network and SMILES contribute to diversity and are considered important factors when combined. However, the proposed ensemble approach has difficulties associated with these diversities.

We proposed a multi-subject comprehensive ensemble due to the difficulties and importance of QSAR problems. In our experiments, the proposed ensemble method consistently outperformed all individual models, and it exhibited superiority over limited subject ensemble approaches and uniform averaging. As part of our future work, we will focus on analyzing as few hypotheses as possible or combinations of hypotheses while maintaining the ensemble effect.

Ensemble learning

Ensemble learning builds a set of diversified models and combines them. Theoretically and empirically, numerous studies have demonstrated that ensemble learning usually yields higher accuracy than individual models [ 11 , 12 , 30 – 32 ]; a collection of weak models (inducers) can be combined to produce a single strong ensemble model.

Ensemble learning can be divided into independent and dependent frameworks for building ensembles [ 33 ]. In the independent framework, also called the randomization-based approach, individual inducers can be trained independently in parallel. On the other hand, in the dependent framework (also called the boosting-based approach), base inducers are affected sequentially by previous inducers. In terms of individual learning, we used both independent and dependent frameworks, e.g. , RF and gradient boosting, respectively. In terms of combining learning, we treated the individual inducers independently.

Diversity is well known as a crucial condition for ensemble learning [ 34 , 35 ]. Diversity leads to uncorrelated inducers, which in turn improves the final prediction performance [ 36 ]. In this paper, we focus on the following three types of diversity.

Dataset diversity

The original dataset can be diversified by sampling. Random sampling with replacement (bootstrapping) from an original dataset can generate multiple datasets with different levels of variation. If the original and bootstrap datasets are the same size ( n ), the bootstrap datasets are expected to have ( \(1-\frac {1}{e}\) ) (≈63.2 % for n ) unique samples in the original data, with the remainder being duplicated. Dataset variation results in different prediction, even with the same algorithm, which produces homogeneous base inducers. Bagging (bootstrap aggregating) belongs to this category and is known to improve unstable or relatively large variance-error factors [ 37 ].

Learning method diversity

Diverse learning algorithms that produce heterogeneous inducers yield different predictions for the same problem. Combining the predictions from heterogeneous inducers leads to improved performance that is difficult to achieve with a single inducer. Ensemble combining of diverse methods is prevalently used as a final technique in competitions, that presented in [ 10 ]. We attempted to combine popular learning methods, including random forest (RF) [ 8 , 38 ], support vector machine (SVM) [ 39 ], gradient boosting machine (GBM) [ 40 ], and neural network (NN).

Input representation diversity

Drugs (chemical compounds) can be expressed with diverse representations. The diversified input representations produce different types of input features and lead to different predictions. [ 21 ] demonstrated improved performance by applying ensemble learning to a diverse set of molecular fingerprints. We used diverse representations from PubChem [ 22 ], ECFP [ 23 ], and MACCS [ 24 ] fingerprints and from a simplified molecular input line entry system (SMILES) [ 25 ].

Combining a set of models

For the final decision, ensemble learning should combine predictions from multiple inducers. There are two main combination methods: weighting (non-learning) and meta-learning. Weighting method, such as majority voting and averaging, have been frequently used for their convenience and are useful for homogeneous inducers. Meta-learning methods, such as stacking [ 41 ], are a learning-based methods (second-level learning) that use predictions from first-level inducers and are usually employed in heterogeneous inducers. For example, let f θ be a classifier of an individual QSAR classifier with parameter θ , trained for a single subject (drug-specific task) p ( X ) with dataset X that outputs y given an input x . The optimal θ can be achieved by

Then, the second-level learning will learn to maximize output y by learning how to update the individual QSAR classifier \(\phantom {\dot {i}\!}f_{\theta ^{*}}\) . “ First-level: individual learning ” section details the first-level learning and, “ Second-level: combined learning ” section details the second-level learning.

Chemical compound representation

Chemical compounds can be expressed with various types of chemical descriptors that represent their structural information. One representative type of chemical compound descriptor is a molecular fingerprint. Molecular fingerprints are encoded representations of a molecular structure as a bit-string; these have been studied and used in drug discovery for a long time. Depending on the transformation to a bit-string, there are several types of molecular fingerprints: structure key-based, topological or path-based, circular, and hybrid [ 42 ]. Structure key-based fingerprints, such as PubChem [ 22 ] and MACCS [ 24 ], encode molecular structures based on the presence of substructures or features. Circular fingerprints, such as ECFP [ 23 ], encode molecular structures based on hashing fragments up to a specific radius.

Another chemical compound representation is the simplified molecular-input line-entry system (SMILES) [ 25 ], which is a string type notation expressing a chemical compound structure with characters, e.g. , C , O , or N for atoms, = for bonds, and ( , ) for a ring structure. SMILES is generated by the symbol nodes encountered in a 2D structure in a depth-first search in terms of a graph-based computational procedure. The generated SMILES can be reconverted into a 2D or 3D representation of the chemical compound.

Examples of SMILES and molecular fingerprints of leucine, which is an essential amino acid for hemoglobin formation, are as follows:

SMILES string: CC(C)CC(C(=O)O)N

PubChem fingerprint: 1,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0, ⋯

ECFP fingerprint: 0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ⋯

MACCS fingerprint: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, ⋯

(Most values in this molecular fingerprint are zero).

Figure 3 shows the two-levels of learning procedure. First-level learning is an individual learning level from diversified learning algorithms and chemical compound representations. The prediction probabilities produced from first-level learning models are used as inputs for second-level learning. Second-level learning makes the final decision by learning the importance of individual models produced from the first-level predictions.

Learning procedure of the proposed comprehensive ensemble. The individual i -th learning algorithm \(\mathcal {L}_{i}\) outputs its prediction probability P i for the training dataset through 5-fold cross-validation. The n diverse learning algorithms produce n prediction probabilities ( P 1 , P 2 , ⋯ , P n ). The probabilities are concatenated and then used as input to the second-level learning algorithm \(\boldsymbol {\mathcal {L}}\) , which makes a final decision \(\hat {y}\) . a First-level learning. b Second-level learning

The notation used in our paper is as follows:

x : preprocessed chemical compound-representation input, where x can be a particular type of molecular fingerprints or SMILES.

h : hidden representation

\(\mathcal {L}\) : first-level individual learning algorithm ( \(\mathcal {L}_{i}\) : i -th algorithm, i ={1, ⋯ , n })

\(\boldsymbol {\mathcal {L}}\) : second-level learning algorithm

P : predicted probability from the individual model ( P i : predicted probability from the \(\mathcal {L}_{i}\) )

\(\hat {y}\) : final predicted decision from the second-level learning

σ : activation function ( σ s : sigmoid, σ r : rectified linear unit (ReLU), and σ t : hyperbolic tangent)

n : total number of individual algorithms

First-level: individual learning

With a combination of learning algorithms and chemical compound input representations, we generated thirteen kinds of individual learning models: nine models from conventional machine learning methods, three models from a plain feed-forward neural network, and one model from the 1D-CNN and RNN-based newly proposed neural network model.

Conventional machine learning methods

Among the conventional machine learning methods, we used SVM, RF, and GBM with three types of molecular fingerprints, resulting in nine combination models consisting of all unique pairs of learning algorithms (SVM, RF, and GBM) and fingerprints (PubChem, ECFP, and MACCS). We set the penalty parameter to 0.05 for the linear SVM, and the number of estimators was set to 100 for RF and GBM based on a grid search and experimental efficiency. The prediction probabilities from these learning methods are used as inputs for second-level learning. However, SVM outputs a signed distance to the hyperplane rather than a probability. Thus, we applied a probability calibration method to convert the SVM results into probabilistic outputs.

Plain feed-forward neural network

We used a plain feed-forward neural network (NN) for the vector-type fingerprints: PubChem-NN, ECFP-NN, and MACCS-NN. The neural network structure consists of three fully connected layers (Fcl) with 512, 64, and 1 units in each layer and using, the ReLU, tanh, and sigmoid activation functions, respectively,

The sigmoid activation function outputs a probability for binary classification. We used the Adam optimizer [ 43 ] with binary cross-entropy loss (learning rate: 0.001, epoch: 30, and mini-batch size: 256).

Convolutional and recurrent neural networks

To learn key features through end-to-end neural network learning automatically, we used a SMILES string as input and exploited the neural network structures of the 1D-CNNs and RNNs. A CNN is used to recognize the short-term dependencies, and an RNN is used as the next layer to learn long-term dependencies from the recognized local patterns.

As illustrated in Fig. 4 of the preprocessing step, the input SMILES strings were preprocessed with one-hot encoding [ 44 – 46 ], which sets only the corresponding symbol to 1 and others to 0. The input is truncated/padded to a maximum length of 100. We only consider the most frequent nine characters in SMILES and treat the remaining symbols as OTHERS , thus the encoding dimension was reduced to 10.

Proposed CNN + RNN model. The input SMILES strings are converted with one-hot encoding and truncated to a maximum length of 100. The preprocessed input is subsequently fed to the CNN layer without pooling, and the outputs are directly fed into the GRU layer

As illustrated in Fig. 4 of the neural networks step, the preprocessed input x was fed into the CNN layer without pooling (CNN filter length: 17, number of filters: 384). Then, the outputs from the CNN were fed into the GRU layer (dimension: 9, structure: many-to-many).

where h is the output of GRU layer, σ r is the ReLU, and σ t is the hyperbolic tangent. The output h was flattened and then fed into a fully connected neural network.

where P is the output probability from the sigmoid activation function for binary classification. The output P is subsequently used for second-level learning as in the last step in Fig. 4 .

We used dropout for each layer (CNN: 0.9, RNN: 0.6, first Fcl: 0.6) and an Adam optimizer (learning rate: 0.001, epoch: 120, mini-batch size: 256) with binary cross-entropy. Most of these hyperparameters were empirically determined.

Second-level: combined learning

We combined the first-level predictions generated from the set of individual models to obtain the final decision.

We have n individual learning algorithms \(\mathcal {L}_{i}\) , where i ={1, ⋯ , n }, and the i -th model outputs the prediction probability P i for a given x . We can determine the final prediction \(\hat {y}\) by weighting, w i :

where if the weight w i =1/ n , ∀ i indicates, uniform averaging .

As another technique, we can combine the first-level output predictions through meta-learning. The performance of individual methods varies depending on each dataset as shown in “ Performance comparison with individual models ” section; there is no invincible universal method. The learned weights from the individual models are applied to the corresponding datasets. Thus, we use learning based combining methods (meta-learning) rather than simple averaging or voting.

where \(\boldsymbol {\mathcal {L}}\) is a second-level learning algorithm, and any machine learning method can be applied this level. All P i , where i ={1,2, ⋯ , n } are concatenated and used as inputs. The model importance imposes a weight w i on P i and is determined through meta-learning.

Availability of data and materials

The datasets generated and/or analyzed during the current study are available at http://data.snu.ac.kr/QSAR/ .

Abbreviations

One-dimensional convolutional neural networks

Area under the curve of the receiver operating characteristic curve

Area under the curve

Gradient boosting machine

Gated recurrent units

High throughput screening

Multi-task neural networks

Neural network

Quantitative structure-activity relationship

Random forest

Recurrent neural network

simplified molecular-input line-entry system

Support vector machine

Verma J, Khedkar VM, Coutinho EC. 3d-qsar in drug design-a review. Curr Top Med Chem. 2010; 10(1):95–115.

Article CAS PubMed Google Scholar

Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets as a method for quantitative structure–activity relationships. J Chem Inf Model. 2015; 55(2):263–74.

Golbraikh A, Wang XS, Zhu H, Tropsha A. Predictive qsar modeling: methods and applications in drug discovery and chemical risk assessment. Handb Comput Chem. 2016:1–48. https://doi.org/10.1007/978-94-007-6169-8_37-3 .

Google Scholar

Luco JM, Ferretti FH. Qsar based on multiple linear regression and pls methods for the anti-hiv activity of a large group of hept derivatives. J Chem Inf Comput Sci. 1997; 37(2):392–401.

Burden FR, Winkler DA. Robust qsar models using bayesian regularized neural networks. J Med Chem. 1999; 42(16):3183–7.

Burden FR, Ford MG, Whitley DC, Winkler DA. Use of automatic relevance determination in qsar studies using bayesian neural networks. J Chem Inf Comput Sci. 2000; 40(6):1423–30.

Pradeep P, Povinelli RJ, White S, Merrill SJ. An ensemble model of qsar tools for regulatory risk assessment. J Cheminformatics. 2016; 8(1):48.

Article Google Scholar

Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. Random forest: a classification and regression tool for compound classification and qsar modeling. J Chem Inf Comput Sci. 2003; 43(6):1947–58.

Zakharov AV, Varlamova EV, Lagunin AA, Dmitriev AV, Muratov EN, Fourches D, Kuz’min VE, Poroikov VV, Tropsha A, Nicklaus MC. Qsar modeling and prediction of drug–drug interactions. Mol Pharm. 2016; 13(2):545–56.

Dahl GE, Jaitly N, Salakhutdinov R. Multi-task neural networks for qsar predictions. arXiv preprint. 2014. arXiv:1406.1231.

Dietterich TG. Ensemble methods in machine learning In: Goos G, Hartmanis J, Van Leeuwen JP, editors. International Workshop on Multiple Classifier Systems. Springer: 2000. p. 1–15.

Hansen LK, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell. 1990; 12(10):993–1001.

Ju C, Bibaut A, van der Laan M. The relative performance of ensemble methods with deep convolutional neural networks for image classification. J Appl Stat. 2018; 45(15):2800–18.

Article PubMed PubMed Central Google Scholar

Ezzat A, Wu M, Li X, Kwoh C-K. Computational prediction of drug-target interactions via ensemble learning. In: Computational Methods for Drug Repurposing. Springer: 2019. p. 239–54. https://doi.org/10.1007/978-1-4939-8955-3_14 .

Agrafiotis DK, Cedeno W, Lobanov VS. On the use of neural network ensembles in qsar and qspr. J Chem Inf Comput Sci. 2002; 42(4):903–11.

Thomas P, Neves M, Solt I, Tikk D, Leser U. Relation extraction for drug-drug interactions using ensemble learning. Training. 2011; 4(2,402):21–425.

Basant N, Gupta S, Singh KP. Predicting human intestinal absorption of diverse chemicals using ensemble learning based qsar modeling approaches. Comput Biol Chem. 2016; 61:178–96.

Wang W, Kim MT, Sedykh A, Zhu H. Developing enhanced blood–brain barrier permeability models: integrating external bio-assay data in qsar modeling. Pharm Res. 2015; 32(9):3055–65.

Article CAS PubMed PubMed Central Google Scholar

Afolabi LT, Saeed F, Hashim H, Petinrin OO. Ensemble learning method for the prediction of new bioactive molecules. PloS ONE. 2018; 13(1):0189538.

Article CAS Google Scholar

Ponzoni I, Sebastián-Pérez V, Requena-Triguero C, Roca C, Martínez MJ, Cravero F, Díaz MF, Páez JA, Arrayás RG, Adrio J, et al.Hybridizing feature selection and feature learning approaches in qsar modeling for drug discovery. Sci Rep. 2017; 7(1):2403.

Zhang L, Ai H, Chen W, Yin Z, Hu H, Zhu J, Zhao J, Zhao Q, Liu H. Carcinopred-el: Novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods. Sci Rep. 2017; 7(1):2118.

Article PubMed PubMed Central CAS Google Scholar

Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH. Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009; 37(suppl 2):623–33.

Morgan H. The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J Chem Doc. 1965; 5(2):107–13.

Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of mdl keys for use in drug discovery. J Chem Inf Comput Sci. 2002; 42(6):1273–80.

Weininger D. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. In: Proc. Edinburgh Math. SOC: 1970. p. 1–14. https://doi.org/10.1021/ci00057a005 .

Swain M. PubChemPy: a way to interact with PubChem in Python. 2014.

Landrum G. Rdkit: Open-source cheminformatics. 2006. https://pubchempy.readthedocs.io/en/latest/ . Accessed 4 Mar 2012.

Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C (Appl Rev). 2012; 42(4):463–84.

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V. Moleculenet: a benchmark for molecular machine learning. Chem Sci. 2018; 9(2):513–30.

Wei L, Wan S, Guo J, Wong KK. A novel hierarchical selective ensemble classifier with bioinformatics application. Artif Intell Med. 2017; 83:82–90.

Article PubMed Google Scholar

Huang M-W, Chen C-W, Lin W-C, Ke S-W, Tsai C-F. Svm and svm ensembles in breast cancer prediction. PloS ONE. 2017; 12(1):0161501.

Xiao Y, Wu J, Lin Z, Zhao X. A deep learning-based multi-model ensemble method for cancer prediction. Comput Methods Prog Biomed. 2018; 153:1–9.

Rokach L. Ensemble-based classifiers. Artif Intell Rev. 2010; 33(1-2):1–39.

Tumer K, Ghosh J. Error correlation and error reduction in ensemble classifiers. Connect Sci. 1996; 8(3-4):385–404.

Krogh A, Vedelsby J. Neural network ensembles, cross validation, and active learning. In: NIPS: 1995. p. 231–8.

Hu X. Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications. In: Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference On. IEEE: 2001. p. 233–40. https://doi.org/10.1109/icdm.2001.989524 .

Breiman L. Bagging predictors. Mach Learn. 1996; 24(2):123–40.

Breiman L. Random forests. Mach Learn. 2001; 45(1):5–32.

Vapnik V. The nature of statistical learning theory. 2013. https://doi.org/10.1007/978-1-4757-3264-1 .

Book Google Scholar

Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29:1189–232.

Wolpert DH. Stacked generalization. Neural Netw. 1992; 5(2):241–59.

Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015; 71:58–63.

Article PubMed CAS Google Scholar

Kingma D, Ba J. Adam: A method for stochastic optimization. arXiv preprint. 2014. arXiv:1412.6980.

Winter R, Montanari F, Noé F, Clevert D-A. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci. 2019; 10(6):1692–701.

Peric B, Sierra J, Martí E, Cruañas R, Garau MA. Quantitative structure–activity relationship (qsar) prediction of (eco) toxicity of short aliphatic protic ionic liquids. Ecotoxicol Environ Saf. 2015; 115:257–62.

Choi J-S, Ha MK, Trinh TX, Yoon TH, Byun H-G. Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources. Sci Rep. 2018; 8(1):6110.

Download references

Acknowledgments

The authors would like to thank the anonymous reviewers of this manuscript for their helpful comments and suggestions.

Publication costs were funded by Seoul National University. This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) [2014M3C9A3063541, 2018R1A2B3001628], the Brain Korea 21 Plus Project in 2018, and the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea [HI15C3224]. The funding bodies did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Sunyoung Kwon and Ho Bae contributed equally to this work.

Authors and Affiliations

Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, South Korea

Sunyoung Kwon & Sungroh Yoon

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, 08826, South Korea

Ho Bae, Jeonghee Jo & Sungroh Yoon

Clova AI Research, NAVER Corp., Seongnam, 13561, South Korea

Sunyoung Kwon

Biological Sciences, Seoul National University, Seoul, 08826, South Korea

Sungroh Yoon

ASRI and INMC, Seoul National University, Seoul, 08826, South Korea

Institute of Engineering Research, Seoul National University, Seoul, 08826, South Korea

You can also search for this author in PubMed Google Scholar

Contributions

SK and HB designed and carried out experiments, performed analysis, and wrote the manuscript. JJ participated in experiments and editing the manuscript. SY conceived and supervised the research and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Sungroh Yoon .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Sunyoung Kwon and Ho Bae are equal contributors.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Kwon, S., Bae, H., Jo, J. et al. Comprehensive ensemble in QSAR prediction for drug discovery. BMC Bioinformatics 20 , 521 (2019). https://doi.org/10.1186/s12859-019-3135-4

Download citation

Received : 02 May 2019

Accepted : 09 October 2019

Published : 26 October 2019

DOI : https://doi.org/10.1186/s12859-019-3135-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Ensemble-learning
Meta-learning
Drug-prediction

BMC Bioinformatics

ISSN: 1471-2105

General enquiries: [email protected]

Milano Chemometrics and QSAR Research Group

Department of Earth and Environmental Sciences – University of Milano-Bicocca

Giacomo Baccolo: Chemometrics approaches for the automatic analysis of metabolomics GC-MS data (2022)

Cecile Valsecchi: Advancing the prediction of Nuclear Receptor modulators through machine learning methods (2022)

Francesca Grisoni: In silico assessment of aquatic bioaccumulation: advances from chemometrics and QSAR modelling (2016)

Matteo Cassotti: QSAR study of aquatic toxicity by chemometrics methods in the framework of REACH regulation (2015)

Kamel Mansouri: New molecular descriptors for estimating degradation and fate of organic pollutants by QSAR/QSPR models within REACH (2013)

Faizan Sahigara: Tools for prediction of environmental properties of chemicals by qsar/qspr within reach. An applicability domain perspective (2013)

Andrea Mauri: Protein and peptide multivariate characterisation using a molecular descriptor based approach (2007)

Davide Ballabio: Chemometric characterisation of physical-chemical fingerprints of food products (2006)

Manuela Pavan: Total and partial ranking methods in chemical sciences (2003)

Information

Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Active Journals
Find a Journal
Proceedings Series
For Authors
For Reviewers
For Editors
For Librarians
For Publishers
For Societies
For Conference Organizers
Open Access Policy
Institutional Open Access Program
Special Issues Guidelines
Editorial Process
Research and Publication Ethics
Article Processing Charges
Testimonials
Preprints.org
SciProfiles
Encyclopedia

Article Menu

Subscribe SciFeed
Recommended Articles
PubMed/Medline
Google Scholar
on Google Scholar
Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Recent advances in fragment-based qsar and multi-dimensional qsar methods.

1. Introduction

2. fragment-based 2d-qsar methods, 2.1. hologram-qsar (hqsar), 2.2. fragment-based qsar (fb-qsar), 2.3. fragment-similarity based qsar (fs-qsar), 2.4. top priority fragment qsar, 2.5. other fragment-related qsar studies, 3.1. comparative molecular field analysis (comfa) and comparative molecular similarity indices analysis (comsia), 3.2. topomer comfa, 3.3. self-organizing molecular field analysis (somfa), 3.4. alignment-free 3d-qsar methods, 3.4.1. autocorrelation of molecular surfaces properties (amsp), 3.4.2. comparative molecular moment analysis (comma), 3.4.3. weighted holistic invariant molecular (whim) descriptor-based qsar, 3.4.4. grid-independent descriptors (grind)-based qsar, 3.5. multi-dimensional (nd) qsar methods, 4. comparison of 2d or fragment-based qsar versus 3d or nd-qsar methods, 5. conclusion, acknowledgement.

Esposito, EX; Hopfinger, AJ; Madura, JD. Methods for applying the quantitative structure-activity relationship paradigm. Methods Mol. Biol 2004 , 275 , 131–213. [ Google Scholar ]
Bradbury, SP. Quantitative structure-activity relationships and ecological risk assessment: an overview of predictive aquatic toxicology research. Toxicol. Lett 1995 , 79 , 229–237. [ Google Scholar ]
Hansch, C; Leo, A. Exploring QSAR—Fundamentals and Applications in Chemistry and Biology ; American Chemical Society: Washington, DC., USA, 1995. [ Google Scholar ]
Hansch, C; Telzer, BR; Zhang, L. Comparative QSAR in toxicology: Examples from teratology and cancer chemotherapy of aniline mustards. Crit. Rev. Toxicol 1995 , 25 , 67–89. [ Google Scholar ]
Perkins, R; Fang, H; Tong, W; Welsh, W. Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environ. Toxicol. Chem 2003 , 22 , 1666–1679. [ Google Scholar ]
Chen, J-Z; Han, X-W; Liu, Q; Makriyannis, A; Wang, J; Xie, X-Q. 3D-QSAR studies of arylpyrazole antagonists of cannabinoid receptor subtypes CB1 and CB2. A combined NMR and CoMFA approach. J. Med. Chem 2006 , 49 , 625–636. [ Google Scholar ]
Salum, L; Andricopulo, A. Fragment-based QSAR: Perspectives in drug design. Mol. Divers 2009 , 13 , 277–285. [ Google Scholar ]
Free, SJ; Wilson, J. A mathematical contribution to structure-activity studies. J. Med. Chem 1964 , 7 , 395–399. [ Google Scholar ]
Hansch, CJ; Fujita, T. ρ−σ−π Analysis. A method for the correlation of biological activity and chemical structure. J. Am. Chem. Soc 1964 , 86 , 1616–1626. [ Google Scholar ]
SYBYL8.0. In Discovery Software for Computational Chemistry and Molecular Modeling ; Tripos: St. Louis, MO, USA, 2008.
Lowis, D. HQSAR: A New, Highly Predictive QSAR Technique, Tripos Technique Notes ; Tripos: St. Louis, MO, USA, 1997. [ Google Scholar ]
Castilho, MS; Postigo, MP; de Paula, CBV; Montanari, CA; Oliva, G; Andricopulo, AD. Two- and three-dimensional quantitative structure-activity relationships for a series of purine nucleoside phosphorylase inhibitors. Bioorg. Med. Chem 2006 , 14 , 516–527. [ Google Scholar ]
Salum, LB; Polikarpov, I; Andricopulo, AD. Structural and chemical basis for enhanced affinity and potency for a large series of estrogen receptor ligands: 2D and 3D QSAR studies. J. Mol. Graph. Model 2007 , 26 , 434–442. [ Google Scholar ]
Honorio, KM; Garratt, RC; Andricopulo, AD. Hologram quantitative structure-activity relationships for a series of farnesoid X receptor activators. Bioorg. Med. Chem. Lett 2005 , 15 , 3119–3125. [ Google Scholar ]
Castilho, MS; Guido, RVC; Andricopulo, AD. 2D Quantitative structure-activity relationship studies on a series of cholesteryl ester transfer protein inhibitors. Bioorg. Med. Chem 2007 , 15 , 6242–6252. [ Google Scholar ]
Lo Piparo, E; Koehler, K; Chana, A; Benfenati, E. Virtual Screening for aryl hydrocarbon receptor binding prediction. J. Med. Chem 2006 , 49 , 5702–5709. [ Google Scholar ]
Tropsha, A; Golbraikh, A. Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Curr. Pharm. Design 2007 , 13 , 3494–3504. [ Google Scholar ]
Prakash, O; Ghosh, I. Developing an antituberculosis compounds database and data mining in the search of a motif responsible for the activity of a diverse class of antituberculosis agents. J. Chem. Inf. Model 2005 , 46 , 17–23. [ Google Scholar ]
Du, Q-S; Huang, R-B; Wei, YT; Pang, Z-W; Du, L-Q; Chou, K-C. Fragment-based quantitative structure-activity relationship (FB-QSAR) for fragment-based drug design. J. Comput. Chem 2009 , 30 , 295–304. [ Google Scholar ]
Myint, KZ; Ma, C; Wang, L; Xie, XQ. The Fragment-similarity-based QSAR (FS-QSAR): A Novel 2D-QSAR method to predict biological activities of triaryl Bis-sulfone and COX2 analogs. 2010 . submitted. [ Google Scholar ]
Burden, F. Molecular identification number for substructure searches. J. Chem. Inf. Comput. Sci 1989 , 29 , 225–227. [ Google Scholar ]
Xie, XQ; Chen, J. Data-mining a small molecule drug screening representative subset from NIH PubChem database. J. Chem. Inf. Model 2008 , 48 , 465–475. [ Google Scholar ]
Casalegno, M; Sello, G. Benfenati E: Top-Priority Fragment QSAR Approach in Predicting Pesticide Aquatic Toxicity. Chem. Res. Toxicol 2006 , 19 , 1533–1539. [ Google Scholar ]
Zhokhova, N; Baskin, I; Palyulin, V; Zefirov, A; Zefirov, N. Fragmental descriptors with labeled atoms and their application in QSAR/QSPR studies. Doklady Chem 2007 , 417 , 282–284. [ Google Scholar ]
Ford, MG. Euroqsar 2002 Designing Drugs and Crop Protectants: Processes, Problems, and Solutions ; Blackwell: Melbourne, Australia, 2003. [ Google Scholar ]
Andrade, C; Salum, LB; Castilho, M; Pasqualoto, K; Ferreira, E; Andricopulo, A. Fragment-based and classical quantitative structure–activity relationships for a series of hydrazides as antituberculosis agents. Mol. Divers 2008 , 12 , 47–59. [ Google Scholar ]
Dragon, v5.4 ; Talete_Srl: Milan, Italy, 2008.
Oliveira, DBD; Gaudio, AC. BuildQSAR: A new computer program for QSAR analysis. Quant. Struct.-Act. Relation 2000 , 19 , 599–601. [ Google Scholar ]
Pirouette Multivariate Data Analysis for IBM PC Systems ; Infometrix: Seattle, WA, USA, 2001.
Tsygankova, I; Zhenodarova, S. Quantitative structure-activity relationship for barbituric acid derivatives: Potential of the fragment approach. Russ. J. Gene. Chem 2007 , 77 , 940–928. [ Google Scholar ]
Cramer, R; Patterson, D; Bunce, J. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc 1988 , 110 , 5959–5967. [ Google Scholar ]
Klebe, G; Abraham, U; Mietzner, T. Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity. J. Med. Chem 1994 , 37 , 4130–4146. [ Google Scholar ]
Dudek, AZ; Arodz, O; Galvez, J. Computational methods in developing quantitative structure-activity relationships (QSAR): A review. Comb. Chem. High T. Scr 2006 , 9 , 213–228. [ Google Scholar ]
Cramer, RD; Cruz, P; Stahl, G; Curtiss, WC; Campbell, B; Masek, BB; Soltanshahi, F. Virtual screening for r-groups, including predicted pIC50 contributions, within large structural databases, using topomer CoMFA. J. Chem. Inf. Model 2008 , 48 , 2180–2195. [ Google Scholar ]
Avram, S; Milac, AL; Flonta, ML. Computer-aided drug design for typical and atypical antipsychotic drugs: A review of application of QSAR and combinatorial chemistry methods - tools for new antipsychotics design. Curr. Comput.-Aided Drug Design 2005 , 1 , 347–364. [ Google Scholar ]
Patcharawee, N; Nahoum, GA; Blair, FJ; Simon, PM; Jiraporn, U. 3D-QSAR studies on chromone derivatives as HIV-1 protease inhibitors: Application of molecular field analysis. Arch. Pharm 2008 , 341 , 357–364. [ Google Scholar ]
Labrie, P; Maddaford, SP; Fortin, S; Rakhit, S; Kotra, LP; Gaudreault, RC. A comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) of anthranilamide derivatives that are multidrug resistance modulators. J. Med. Chem 2006 , 49 , 7646–7660. [ Google Scholar ]
Jeong, JA; Cho, H; Jung, SY; Kang, HB; Park, JY; Kim, J; Choo, DJ; Lee, JY. 3D QSAR studies on 3,4-dihydroquinazolines as T-type calcium channel blocker by comparative molecular similarity indices analysis (CoMSIA). Bioorg. Med. Chem. Lett 2010 , 20 , 38–41. [ Google Scholar ]
Dayan, FE; Singh, N; McCurdy, CR; Godfrey, CA; Larsen, L; Weavers, RT; van Klink, JW. Perry NB: β-triketone inhibitors of plant p-hydroxyphenylpyruvate dioxygenase: Modeling and comparative molecular field analysis of their interactions. J. Agric. Food Chem 2009 , 57 , 5194–5200. [ Google Scholar ]
Cramer, RD. Topomer CoMFA: A design methodology for rapid lead optimization. J. Med. Chem 2003 , 46 , 374–388. [ Google Scholar ]
Robinson, DD; Winn, PJ; Lyne, PD; Richards, WG. S elf-organizing molecular field analysis: A tool for structure-activity studies. J. Med. Chem 1999 , 42 , 573–583. [ Google Scholar ]
Bravi, G; Gancia, E; Mascagni, P; Pegna, M; Todeschini, R; Zaliani, A. MS-WHIM, new 3D theoretical descriptors derived from molecular surface properties: A comparative 3D QSAR study in a series of steroids. J. Comput.-Aided Mol. Design 1997 , 11 , 79–92. [ Google Scholar ]
Wagener, M; Sadowski, J; Gasteiger, J. Autocorrelation of molecular surface properties for modeling corticosteroid binding globulin and cytosolic ah receptor activity by neural networks. J. Am. Chem. Soc 1995 , 117 , 7769–7775. [ Google Scholar ]
Silverman, BD; Platt, DE. Comparative molecular moment analysis (CoMMA): 3D-QSAR without molecular superposition. J. Med. Chem 1996 , 39 , 2129–2140. [ Google Scholar ]
Todeschini, R; Gramatica, P. Kubinyi, H, Folkers, G, Martin, YC, Eds.; New 3D molecular descriptors: the WHIM theory and QSAR applications. In 3D QSAR in Drug Design ; Kluwer Academic Publishers/Escom: Dordrecht, The Netherlands, 1998; Volume 2, pp. 355–380. [ Google Scholar ]
Todeschini, R; Lasagni, M. Marengo E: New molecular descriptors for 2D and 3D structures. J. Chemometrics 1994 , 8 , 263–272. [ Google Scholar ]
Pastor, M; Cruciani, G; McLay, I; Pickett, S; Clementi, S. GRid-INdependent descriptors (GRIND): A novel class of alignment-independent three-dimensional molecular descriptors. J. Med. Chem 2000 , 43 , 3233–3243. [ Google Scholar ]
Connolly, M. Analytical molecular surface calculation. J. Appl. Crystallogr 1983 , 16 , 548–558. [ Google Scholar ]
Pastor, M; Cruciani, G; Watson, KA. A strategy for the incorporation of water molecules present in a ligand binding site into a three-dimensional quantitative structure-activity relationship analysis. J. Med. Chem 1997 , 40 , 4089–4102. [ Google Scholar ]
ALMOND. Molecular Discovery Ltd: Perugia, Italy. Available at: http://www.moldiscovery.com/soft_almond.php (accessed on 25 September 2010).
Hopfinger, AJ; Wang, S; Tokarski, JS; Jin, B; Albuquerque, M; Madhav, PJ; Duraiswami, C. Construction of 3D-QSAR models using the 4D-QSAR analysis formalism. J. Am. Chem. Soc 1997 , 119 , 10509–10524. [ Google Scholar ]
Scheiber, J; Enzensperger, C; Lehmann, J; Stiefl, N; Baumann, K. Aki-Sener, E, Yalcin, I, Eds.; Alignment-free 4D-QSAR: Applying the XMAP technique in prospective analyses. In QSAR & Molecular Modeling in Rational Design of Bioactive Molecules ; CADDD Society: Ankara, Turkey, 2006. [ Google Scholar ]
Fischer, PM. Computational chemistry approaches to drug discovery in signal transduction. Biotechnol. J 2008 , 3 , 452–470. [ Google Scholar ]
Vedani, A; Dobler, M. 5D-QSAR: The key for simulating induced fit? J. Med. Chem 2002 , 45 , 2139–2149. [ Google Scholar ]
Vedani, A; Dobler, M; Lill, MA. Combining protein modeling and 6D-QSAR. Simulating the binding of structurally diverse ligands to the estrogen receptor. J. Med. Chem 2005 , 48 , 3700–3703. [ Google Scholar ]
Vedani, A; Dobler, M; Zbinden, P. Quasi-atomistic receptor surface models: A bridge between 3-D QSAR and receptor modeling. J. Am. Chem. Soc 1998 , 120 , 4471–4477. [ Google Scholar ]
Biograf. VirtualToxLab: Basel, Switzerland, 2009. Available at: http://www.biograf.ch/downloads/VirtualToxLab.pdf (accessed on 25 September 2010).
Hillebrecht, A; Klebe, G. Use of 3D QSAR models for database screening: A feasibility study. J. Chem. Inf. Model 2008 , 48 , 384–396. [ Google Scholar ]
Matter, H; Potter, T. Comparing 3D pharmacophore triplets and 2D fingerprints for selecting diverse compound subsets. J. Chem. Inf. Comput. Sci 1999 , 39 , 1211–1225. [ Google Scholar ]
Khedkar, V; Ambre, P; Verma, J; Shaikh, M; Pissurlenkar, R; Coutinho, E. Molecular docking and 3D-QSAR studies of HIV-1 protease inhibitors. J. Mol. Model 2010 , 16 , 1251–1268. [ Google Scholar ]
Li, Q; J⊘rgensen, FS; Oprea, T; Brunak, S; Taboureau, O. hERG classification model based on a combination of support vector machine method and GRIND descriptors. Mol. Pharm 2008 , 5 , 117–127. [ Google Scholar ]
Romeiro, N; Albuquerque, M; Alencastro, R; Ravi, M; Hopfinger, A. Construction of 4D-QSAR models for use in the design of novel p38-MAPK inhibitors. J. Comput.-Aided Mol. Design 2005 , 19 , 385–400. [ Google Scholar ]

Click here to enlarge figure

Summary of different QSAR methods and source information.
Method	nD	Dataset	Statistical model	Performance	Reference/Website
HQSAR	2D	21 Steroids	PLS	q = 0.71; r = 0.85 [ ]	[ ]
FB-QSAR	2D	48 NA analogs	IDLS	r = 0.95 (r = 0.91) [ ]	[ ]
FS-QSAR	2D	85 bis-sulfone analogs; 83 COX2 analogs	MLR	r = 0.68; r = 0.62 [ ]	[ ]
TPF-QSAR	2D	282 pesticides	PM-based prediction	r = 0.75 [ ]	[ ]
CoMFA	3D	21 Steroids 54 HIV-1PR inhibitors	PLS	q = 0.75; r = 0.96 [ ] q = 0.68; r = 0.69 [ ]	[ ] [ ]
CoMSIA	3D	Thermolysin inhibitors 54 HIV-1PR inhibitors	PLS	q = [0.59, 0.64] [ ] q = 0.65; r = 0.73 [ ]	[ , ] [ ]
Topomer CoMFA	3D	15 datasets from literature	PLS	average q = 0.636 [ ]	[ ]
SOMFA	3D	31 steroids; 35 sulfonamides	MLR	r = 0.58; r = 0.53 [ ]	[ ]
AMSP	3D	31 steroids	MNN	q = 0.63; r = 0.67 [ ]	[ ]
CoMMA	3D	31 steroids	PLS	q = [0.41, 0.82] [ ]	[ ]
WHIM	3D	31 steroids	PCA	SDEP = 1.750 [ ]	[ ]
MS-WHIM	3D	31 steroids	PCA	SDEP = 0.742 [ ]	[ ]
GRIND	3D	31 steroids 175 hERG inhibitors	PLS; PCA PLS; SVM	q = 0.64; SDEP = 0.26 [ ] q = 0.41; r = 0.57; SDEP = 0.72 [63]	[ ] [63]
4D-QSAR	4D	20 DHFR inhibitors; 42 PGF a analogs; 40 2-substituted dipyridodiazepione inhibitors 33 p38-MAPK inhibitors	PLS GL-PLS	r = [0.90, 0.95]; r = [0.73, 0.86]; r = [0.67, 0.76] [ ] q = [0.67, 0.85] [64]	[ ] [64]
5D-QSAR	5D	65 NK-1 antagonists; 131 Ah ligands	MLR	r = 0.84; r = 0.83 [ ]	[ ]
6D-QSAR	6D	106 estrogen receptor ligands	MLR	q = 0.90; r = 0.89 [ ]	[ ]
HQSAR = Hologram QSAR FB-QSAR = Fragment-based QSAR FS-QSAR = fragment-similarity-based QSAR TPF-QSAR = Top priority fragment QSAR CoMFA = Comparative molecular field analysis CoMSIA = Comparative molecular similarity indices analysis SOMFA = Self-organizing molecular field analysis AMSP = Autocorrelation of molecular surface properties CoMMA = Comparative molecular moment analysis WHIM = Weighted holistic invariant molcular QSAR MS-WHIM = Molecular surface WHIM GRIND = Grid independent descriptor				PLS = Partial least square IDLS = Iterative double least square PM = Priority matrix MNN = Multilayer neural networks MLR = Multiple linear regression PCA = Principal component analysis	q = cross-validated r SDEP = standard deviation of errors of prediction

© 2010 by the authors; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

Myint, K.Z.; Xie, X.-Q. Recent Advances in Fragment-Based QSAR and Multi-Dimensional QSAR Methods. Int. J. Mol. Sci. 2010 , 11 , 3846-3866. https://doi.org/10.3390/ijms11103846

Myint KZ, Xie X-Q. Recent Advances in Fragment-Based QSAR and Multi-Dimensional QSAR Methods. International Journal of Molecular Sciences . 2010; 11(10):3846-3866. https://doi.org/10.3390/ijms11103846

Myint, Kyaw Zeyar, and Xiang-Qun Xie. 2010. "Recent Advances in Fragment-Based QSAR and Multi-Dimensional QSAR Methods" International Journal of Molecular Sciences 11, no. 10: 3846-3866. https://doi.org/10.3390/ijms11103846

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

Subscribe to receive issue release notifications and newsletters from MDPI journals

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Elsevier - PMC COVID-19 Collection

Quantitative structure–activity relationship-based computational approaches

Virupaksha bastikar.

1 Amity Institute of Biotechnology, Amity University, Mumbai, Maharashtra, India

Alpana Bastikar

2 Navin Saxena Research and Technology Pvt. Ltd, Gandhidham, Gujarat, India

Pramodkumar Gupta

3 School of Biotechnology and Bioinformatics, D Y Patil Deemed to be University, Navi Mumbai, Maharashtra, India

World Health Organization (WHO) categorized novel Coronavirus disease (COVID-19), triggered by severe acute respiratory syndrome-Coronavirus-2 (SARS-CoV-2) as a world pandemic. This infection has been increasing alarmingly by instigating enormous social and economic disturbance. In order to retort rapidly, the inhibitors previously designed against different targets will be a good starting point for anti-SARS-CoV-2 inhibitors. The chapter deals with various quantitative structure–activity relationship (QSAR) techniques currently used in computational drug design and their applications and advantages in the overall drug design process. The chapter reviews current QSAR studies carried out against SARS-COV-2. The QSAR study design is composed of some major facets: (1) classification QSAR-based data mining of various inhibitors, (2) QSAR-based virtual screening to recognize molecules that could be effective against assumed COVID-19 protein targets. (3) Finally validation of hits through receptor–ligand interaction analysis. This approach is used overall to help in the process of COVID-19 drug discovery. It presents key conceptions, sets the stage for QSAR-based screening of active molecules against SARS-COV-2. Moreover, the QSAR models reported can be further used to monitor huge databases. This chapter gives a first-hand review of all the current QSAR parameters developed for generating a good QSAR model against SARS-COV-2 and subsequently designing a drug against the COVID-19 virus.

10.1. Introduction

Quantitative structure–activity relationship (QSAR) could be a methodology to associate the chemical arrangement of a molecule with its biochemical, physical, pharmaceutical, biological, etc., effect. The exploitation of QSAR developed strategies can be done significantly in chemo computing, drug discovery and to calculate the biological activity of chemical compounds, but also additionally for pharmacological medicine and ecotoxicological assessments of individual chemicals among the risk management. QSAR models are developed for computational drug design, activity prediction, and toxicology predictions. QSAR is outlined as the quantitative correlation of biological activities with chemical science properties ( Puzyn & Leszczynski, 2012 ).

Biological activity = f (physicochemical parameter)

QSAR studies have a very important application in modern chemistry and biochemistry. QSAR helps in finding the compounds with desired properties using chemical information and its association with biological activity. The physicochemical properties such as partition coefficient and presence or absence of certain chemical features are taken into consideration. QSAR attempts to correlate structural, chemical, statistical, and physical properties with biological potency using various mathematical methods. The generated QSAR models are used to predict and classify the biological activities of new chemical compounds. QSAR guides the process of lead optimization and is also used as a screening and enrichment tool to remove the compounds and molecules that do not possess drug-likeness properties or are predicted toxic ( Gajewicz et al., 2012 ) ( Fig. 10.1 ).

An external file that holds a picture, illustration, etc.
Object name is f10-01-9780323911726.jpg

History of quantitative structure–activity relationship.

10.2. The importance of quantitative structure–activity relationship

The motivation behind developing in silico QSAR models examines, and incorporates the following points:

1. To foresee natural action of the compounds and understand physical-substance properties by mathematical methods. The natural activity of the compounds can be studied and predicted by the development of the QSAR models for many drug classes.
2. To comprehend and rationalize the mechanisms of action within a series of chemicals. By developing a QSAR model using these fixed mechanisms of actions for a series of molecules the activity of unknown molecules can be predicted. A group of similar molecules generally exhibit a similar type of activity and give activity in a particular fixed range. Thus any new molecule that is developed that belongs to a similar class, its activity can also be predicted and a QSAR mathematical model helps to improve its activity and design new molecules.
3. Savings in the expense of compound advancement (e.g., in the drug, pesticide) in terms of synthesis and manufacturing of the molecule as well as in vitro and in vivo testing of the molecule. Once mathematically it is proven that a given set of newly designed molecules gives a better activity, only those can be taken forward for synthesis, rejecting the others that are not predicted as having good activity. Thus the cost of synthesis and time required for the entire study is comparatively reduced as against the traditional method of drug design.
4. Predictions could lessen the prerequisite for extensive and costly animal tests thereby avoiding ethical issues and concerns. Every time sacrificing an animal just to check whether a given novel molecule shows activity or not is overall not feasible, both in terms of cost, time, and also ethically. QSAR helps to avoid unnecessary testing of animals for the novel molecules.
5. Other spaces of advancing green and greener science to expand productivity and eliminate waste by not following leads unlikely to be successful. Those molecules that are going to be harmful to the environment can be avoided to be synthesized based on the results of the QSAR ( Aptula & Roberts, 2006 ).

10.3. Requirements to generate a good quantitative structure–activity relationship model

Based on the above ( Fig. 10.2 ) a QSAR model requires the following tools-

1. A set of molecules to be used for generating the QSAR model: A dataset consisting of molecules, structurally similar, whose QSAR model needs to be developed are to be prepared for the QSAR study. Depending upon the type of QSAR the molecules need to be minimized or cleaned.
2. A set of molecular descriptors generated for the data set of molecules: Once the molecules are finalized, the parameters of the molecules known as the descriptors are calculated, which can be the overall structural properties of the molecules, two-dimensional properties of the molecules, three-dimensional properties of the molecule in space, or the different conformational properties of the molecules.
3. Biological activity (IC50, EC50, etc.) of the set of molecules: The molecules whose QSAR model is to be developed should have a definite and known biological activity value that can be correlated with the molecular descriptors generated, to develop a good and reliable QSAR model.
4. Statistical methods to develop a QSAR model: Various statistical methods like clustering, partial least square, regression, principal component analysis (PCA), etc., can be used to develop a mathematical correlation between the biological activity and the descriptors calculated.
5. The QSAR model thus generated is validated and if found to be full-proof is used further to predict the activity of any unknown compound belonging to the same class of molecules as the data set in terms of the same disease, the same type of biological activity, same scaffold, same pharmacophore, etc.

An external file that holds a picture, illustration, etc.
Object name is f10-02-9780323911726.jpg

Quantitative structure–activity relationship.

10.4. Applications of quantitative structure–activity relationship in various fields

The capacity to foresee an organic movement is important in quite a few ventures. While some QSARs give off an impression of being minimal more than scholarly examinations, there are countless uses of these models inside the industry, the scholarly world, and (administrative) offices ( OECD, 2007 ). Few potential uses are recorded beneath:

1. Chemical : One of the primary authentic applications is to anticipate limits. It is notable, for example, that inside a specific group of substance compounds, particularly of natural science, these are solid connections amongst the construction of the molecule and its noticed properties. A basic model is a connection between the quantity of carbon in alkanes and their limit. There is an unmistakable pattern in the increment of the edge of boiling over with an increment in the carbon, and this serves as a method for foreseeing the edge of boiling over of higher alkanes. Thus this chemical property can be exploited by generating a QSAR model of the said property and predicting alkanes based on their boiling points.
2. Natural : The organic action of a particle is normally estimated in order to set up the degree of the hindrance of specific signal transduction or metabolic pathway. Medication disclosure frequently includes the utilization of QSAR to recognize synthetic design that could have a great inhibitory impact on the said protein target. A set of organic molecules can be tested against a particular protein or enzyme target to study their effect on the metabolic pathway involved. A QSAR model developed is definitely useful to study the mechanism of action of the drugs on the metabolic pathway.
3. The QSAR model gives a sensible distinguishing proof of new leads with pharmacological, biocidal, or pesticide activities.
4. The QSAR model deals with the enhancement of pharmacological, biocidal, or pesticide activities.
5. The QSAR model allows distinguishing toxic compounds at the beginning phases of ligand improvement or the screening of various databases of existing compounds.
6. The QSAR model forecasts the poisonousness to natural species. The choice of mixtures with ideal pharmacokinetic properties, regardless of whether they be synthesized or accessible in organic frameworks can be given.
7. The forecast of an assortment of physicocompound properties of atoms (whether they be drugs, pesticides, individual items, fine synthetic substances).

Characteristic features of a good QSAR model ( Todeschini & Consonni, 2000 ):

1. A defined endpoint: Every QSAR model should be developed for a specific endpoint, for example, biological activity, toxicity, skin Sensitization, mutagenetic, etc., which should be specified at the beginning of the model prediction.
2. An unambiguous algorithm: An algorithm or mathematical model which can predict the given defined endpoint and not give any other vague result.
3. A defined domain of applicability: The Physicochemical, structural or biological space, data, or information on which the training set of the model has been established, and for which it is applicable to make calculations for new compounds.
4. An appropriate measure of goodness of fit: The goodness of fit of a statistical mathematical model describes how well it fits a set of observations. Measures of goodness of fit classically encapsulate the inconsistency between observed values and the values expected under the model developed.

10.5. The different stages of advancement of quantitative structure–activity relationship

1. One-dimensional QSAR: This is the first type of QSAR model to be developed that correlates the pKa (dissociation constant) and log P (partition coefficient). This takes into account the overall structure and its pKa and logs P correlation.
2. Two-dimensional QSAR: The biological Activity correlates to the overall structure pattern of drug molecules. It takes into account the entire structure of the molecule in two-dimensional space. Various parameters of the structure of the molecule are calculated and correlated to the biological activity. For example, no hydrogen bonds, molecular refractivity, topological indices, dipole moment, etc.
3. Three-dimensional QSAR: The biological Activity correlates with the three-dimensional structure of the molecule and its properties. It takes into account the molecule in its three-dimensional space. The different parameters like a steric hindrance, h-bond acceptors, h-bond donors, hydrophobic interactions are a part of three-dimensional QSAR.
4. Four-dimensional QSAR: It is the same as three-dimensional QSAR along with multiple representations of ligand conformations. It takes into account the different conformations of the ligand molecule in space. It studies how the ligand can be placed in different conformations in the space and what are the changes in the three-dimensional parameters based on the conformational changes. Based on the changes in the parameter values different QSAR models are developed.
5. Five-dimensional QSAR: Same As that with the four-dimensional along with multiple representations of ligands in the docked complexes. It takes into account the ligand–receptor binding and the different conformations of ligand in the docked complex three-dimensional space. It studies the different conformations of the ligand however now it includes the receptor binding interactions of the ligands. The different conformations are based on the changes in the docked complexes of the ligands and receptors.
6. Six-dimensional QSAR: Same As with five-dimensional along with multiple representations of molecular dynamic studies of the receptor–ligand complexes. Along with the different conformations of the ligands in the receptor–ligand complexes, this QSAR also takes into account the changes occurring in the stability of the complex during the molecular dynamics simulations. The energy calculated for different ligand conformations at different time intervals forms the basis of the development of this QSAR.

10.5.1. Steps and strategies for quantitative structure–activity relationship

QSAR modeling process consists of five main steps ( Ekins, 2007 ):

1. Begins with the selection of molecules to be used: Preparation of dataset—it consists of a set of molecules against which the QSAR model is to be prepared.
2. Selection of descriptor; numerical represented of molecular feature (e.g., no. of carbon): Various parameters of the dataset are generated that can be correlated with the biological activity of the dataset molecules.
3. The original descriptor pool must be reduced in size: Screening of the generated descriptors to keep only the relevant directly linked to the biological activity.
4. Model building: Using statistical methods a mathematical model is built correlating the screened descriptors with the biological activity.
5. The reliability of the model should be tested: The prediction capacity of the model is checked on a given set of test compounds.

10.6. Molecular descriptors

Atomic descriptors are a mathematical portrayal of compound data present inside a particle. This numerical portrayal must be invariant to the particle's size and the number of iotas for building a model with measurable methodologies ( Tropsha et al., 2003 ). The three significant kinds of boundaries and related descriptors are given in Fig. 10.3 . The data about atomic descriptors relies upon two central points:

1. The molecular representation of compounds.
2. The algorithm used for the calculation of the descriptor.

An external file that holds a picture, illustration, etc.
Object name is f10-03-9780323911726.jpg

Molecular descriptors for quantitative structure–activity relationship.

10.7. Methods of quantitative structure–activity relationship

A wide range of ways to deal with QSAR has been created since Hansch's fundamental works. QSAR strategies can be investigated from two perspectives ( Gramatica, 2007 ):

1. The sorts of underlying boundaries that are utilized to describe subatomic personalities begin from the various portrayals of particles, from basic synthetic equations to 3D conformities.
2. The numerical system is utilized to acquire the quantitative connection between these primary boundaries and organic action. The figure clarifies the technique of QSAR utilized for any broad QSAR type. Constructions are divided to build up their pertinent descriptor properties. With the assistance of different numerical investigation devices, the information is prepared to set up a numerical QSAR model, which will associate with the natural movement. The model created is approved by different approval techniques and tried for outside expectations. At long last, a powerful QSAR model is set up that considers the pertinent boundaries for the natural action for the given arrangement of mixtures.

A model medication applicant is required to have unmistakable properties, that is, compound properties, solvency, enzymatic soundness, penetration across natural layers, low leeway by the liver or kidney, strength, and wellbeing. Out of various accessible descriptors, the choice of the central atomic descriptors is the main test in a QSAR. Subsequently, to comprehend the QSAR model, to diminish overfitting, speed up preparation, and to improve the general model consistency, the decision of suitable and interpretable descriptors to set up QSAR models is a very pivotal advance.

10.8. Data analysis methods

10.8.1. free wilson analysis.

It is a structure–activity evaluation technique that considers the contribution of diverse structural fragments to the general organic activity. Indicator variables outline the presence or absence of a specific structural characteristic in a molecule. This mathematical model considers the symmetry equation to limit linear dependency between variables ( Fig. 10.4 ) ( Puzyn & Leszczynski, 2012 ).

An external file that holds a picture, illustration, etc.
Object name is f10-04-9780323911726.jpg

Data analysis methods.

10.8.2. Statistical methods

Statistical techniques offer the premise for the improvement of QSAR evaluation. The software of multivariate evaluation, data description, classification, and regression evaluation are used for interpretation and theoretical prediction of organic features for new compounds ( Puzyn & Leszczynski, 2012 ).

10.8.3. Discriminant analysis

Discriminant evaluation is used to split molecules into their constituent classes. It reveals a linear mixture of things that high-quality discriminates among one-of-a-kind constituents classes. This approach is used for the evaluation in preference to a couple of linear regressions because the organic interest information isn’t on a nonstop scale of interest however labeled as lively and inactive ( Puzyn & Leszczynski, 2012 ). It is used to symbolize a quantitative courting among molecular descriptors and the organic property.

10.8.4. Cluster analysis

Clustering is the manner of dividing a set of devices into agencies in order that every cluster includes distinctly comparable gadgets, and items in a single cluster are dissimilar gadgets of different clusters. When cluster evaluation is implemented on a compound dataset, the range of clusters affords records approximately the range of structural kinds found in a compound set. A numerous subset of compounds may be prepared with the aid of using taking one or extra compounds from every cluster ( Puzyn & Leszczynski, 2012 ). It is implemented to pattern numerous subsets of compounds from a bigger compound dataset. Hierarchical clustering, k-way clustering, and nonhierarchical clustering are the techniques used for compound clustering.

10.8.5. Principal component analysis

The quantity of variables used to explain an item is referred to as dimensionality. PCA is used to lessen the dimensionality of the statistics set while a huge correlation exists among a few or all the variables (descriptors). PCA gives facts approximately the huge essential additives and represents most facts on impartial variables.

10.8.6. Quantum mechanical methods

Quantum mechanical strategies are used to understand correct molecular identities such as electrostatic capacity or polarizabilities, ionization capacity or electron affinities, etc. This approach is implemented to QSAR via way of means of the direct derivation of digital descriptors from the molecular wave function.

10.9. Quantitative structure–activity relationship model validation

After a QSAR model is developed it is necessary to validate it for its accuracy and predictively as well as precision ( Fig. 10.5 ) ( Veerasamy et al., 2011 ).

An external file that holds a picture, illustration, etc.
Object name is f10-05-9780323911726.jpg

Model validation.

After the model validation, the model applicability domain needs to be checked, where the outliers will be thrown out, during model building ( Gramatica, 2007 ) ( Fig. 10.6 ).

An external file that holds a picture, illustration, etc.
Object name is f10-06-9780323911726.jpg

Model applicability domain.

10.10. Quantitative structure–activity relationship and Coronavirus disease-2019

The emission of COVID-19 has borne contrarily on populations' day-by-day lives. Indeed, it has undermined their wellbeing genuinely, intellectually, and mentally and hampered social and monetary improvement. Individuals during the time of isolation are experiencing plenty of burdensome manifestations because of numerous reasons among which the absence of actual work and dread are the most well-known ones. Researchers and analysts are dashing to bring a way forward and to discover immunizations or medications against COVID-19. By the by, there is no particular medication that has been accounted for in light of the fact that the creation of an effectual and solid medication requires quite a while of examination and clinical preliminaries. Subsequently, drug repositioning has been a methodology embraced by a majority of specialists worldwide to look for viable treatment in a brief timeframe ( Tandon et al., 2019 ) ( Fig. 10.7 ).

An external file that holds a picture, illustration, etc.
Object name is f10-07-9780323911726.jpg

Quantitative structure–activity relationship and drug design.

There have been various studies like docking analysis, molecular modeling, and simulations to develop new drugs against COVID-19. Many researchers are focusing on the repurposing of drugs as a potential treatment against COVID-19. To that effect, various computational techniques have been used to assist the development of molecules. Various QSAR studies have been reported that are used to develop leads and hits for COVID-19. Some studies have been reported below.

Sulfonamides are organically dynamic compounds since they are of essential significance. There are numerous sulfonamide drugs in the business sector for treating infections of various nature. Sulfonamide subsidiaries, for example, methazolamide, dichlorphenamide, ethoxzolamide, acetazolamide, and dorzolamide have been clinically wagered on for quite a long time as inhibitors of the zinc catalyst carbonic anhydrase. On account of their moderateness and minimal expense, they are intensely utilized as veterinary antimicrobials in many parts of the world, particularly in Asia, a few regions of Europe, and many rising nations. Sulfonamide subordinates are a significant moiety of various scopes of bioactive molecules and drug particles like antibacterial, anticancer, antitumor, and antimalarial. Inferable from the general medical problem and absence of a powerful fix, numerous nations are settling on Chloroquine as an antimalarial drug for the therapy of COVID-19. Thusly, it has become critical to attempt to find new medications that can be more believable and compelling without having any destructive results than the Chloroquine used to fix the new pandemic. With that in mind, a bunch of eighteen carboxamides sulfonamide analogs, present antimalarial action were examined utilizing both CoMFA and CoMSIA approaches which are a type of three-dimensional QSAR modeling. Moreover, subatomic docking reproduction was accomplished to investigate the binding between SARS-CoV-2 primary protease and carboxamides sulfonamide compounds. In this examination, the antimalarial action and synthetic designs of 18 carboxamides sulfonamide subsidiaries were taken from the literature. These particles were considered to direct the three-dimensional QSAR examination by parting the information base into two datasets; a preparation set of 14 atoms to foster the quantitative model and a test set of four compounds to affirm the capability of the former model ( Khaldan et al., 2021 ). The following figure demonstrates the SAR established with the help of the developed QSAR model ( Fig. 10.8 ).

An external file that holds a picture, illustration, etc.
Object name is f10-08-9780323911726.jpg

Severe acute respiratory syndrome from quantitative structure–activity relationship.

In the point of finding new powerful medications against COVID-19, the three-dimensional QSAR and subatomic docking considers were applied on a progression of eighteen carboxamides sulfonamide subordinates. The ideal CoMFA and CoMSIA models unveiled great factual results as far as a few thorough measurable keys, like Q2, R2, and R2test, thereupon, these models can be proficiently upheld to anticipate new molecules with significant activity. The shape maps created by CoMFA and CoMSIA models, uncover the significant destinations where steric, electrostatic, and hydrophobic collaborations may essentially be impacting (increment or lessening) the action of the particles. These form maps guided to propose eight atoms with significant inhibitory movement ( Ivanov et al., 2020 ).

In one more examination, researchers curated more than 1000 inhibitors with structure−bioactivity information as preparing atoms for 3CLpro and RdRp protein targets. They gathered this information from the most current SARS-CoV-2 bioassay concentrates just as existing investigations with SARS-CoV-1, MERS-CoV, and other related infections in the CAS information assortment. Utilizing this information, they applied an assortment of AI calculations to assemble a few dozen QSAR models selecting from among these, the most grounded performing models one focusing on 3CLpro and one focusing on RdRp ( Amin et al., 2020 ).

The subsequent models were utilized to screen 1087 FDA-endorsed drugs, almost 50,000 substances from the CAS COVID-19 Antiviral Candidate Compounds Dataset, a rundown of 113,000 substances with CAS-appointed pharmacological action or a helpful job filed in SARS, MERS, and COVID-19-related records distributed since 2003. Some anticipated atoms of these models were approved by distributed bioassay considers and clinical preliminaries as a positive sign of the prescient models. The model was then likewise applied to the CAS COVID-19 Antiviral Candidate Compounds Dataset, which contains 49,437 mixtures with potential antiviral movement recognized by CAS researchers. The model anticipated that 970 of these substance compounds are probably going to be dynamic against 3CLpro of the Covid. From every one of these applications, a couple of chosen atoms with the most elevated hindrance likelihood. True to form, the model recognized a few notable HIV-1 protease inhibitors (ritonavir and lopinavir) and distinguished substances (RNs 2243743–58-8, 1934276–50-2, and 2229818–46-4) that objective 3C protease/3CLpro and was appeared to hinder Enterovirus, MERS-CoV, and SARS-CoV-1 when tried in bioassays. These could address new lead applicants as helpful specialists for COVID-19 or other viral diseases. The model additionally recognized substances against have proteins engaged with cell measures, including diltiazem hydrochloride and leflunomide. Leflunomide is a dihydroorotate dehydrogenase inhibitor and is associated with nucleotide amalgamation ( Rafi et al., 2020 ).

The investigation configuration was made out of two significant angles (1) Ligand-based methodologies: (A) grouping QSAR-based information mining of different SARS-CoV Papin-like protease (PLpro) inhibitors, (B) QSAR-based virtual screening (VS) to distinguish in-house particles that could be viable against putative objective SARS-CoV PLpro and (2) Structure-based methodologies: at long last approval of hits through receptor—ligand association examination. Subsequently, this investigation presented key ideas, set up for particle ID and QSAR-based screening of in-house atoms dynamic against putative SARS-CoV-2 PLpro chemical. Here, a model was developed which was an order-based QSAR model that could be utilized as a device for foreseeing new atoms and additionally VS. The model created by Monte Carlo advancement-based QSAR was trailed by VS of some in-house synthetic compounds. At that point, ADME information-driven screening was performed by SwissADME and distinguished mixtures with great medication resemblance. At long last, atomic docking investigation of QSAR inferred virtual hits was performed to build the trust in the last theories. The subatomic docking study performed against putative objective SARS-CoV-2 PLpro recommended the probability of these researched in-house particles. Hence, it tends to be inferred that the in-house particles can possibly use as a seed for drug plan and enhancement against SARS-CoV-2 PLpro. After broad in vitro and in vivo considers, these in-house VS hits might arise as helpful alternatives for COVID-19. This investigation may likewise propel restorative physicists to plan comparative kinds of mixtures in desires to trigger natural power just as viability without gathering poison levels ( Płonka et al., 2020 , Tejera et al., 2020 ).

10.11. Conclusion

COVID-19 has been creating havoc throughout the world. Scientists and researchers are emerged in developing vaccines and medicines against the virus. Various techniques like drug repurposing and high throughput screening are used to develop medicines for the immediate treatment of SARS-COV-2. QSAR is a computational methodology that has been used for ages for the screening of molecules by developing mathematical models to predict the activity of unknown lead compounds. The same technique has been used for the development of mathematical models in the treatment of COVID-19 to develop hits for the treatment of patients suffering from COVID-19. This gives hope that by using computational techniques more molecules can be developed against the pandemic.

Amin S.A., Ghosh K., Gayen S., Jha T. Chemical-informatics approach to COVID-19 drug discovery: Monte Carlo based QSAR, virtual screening and molecular docking study of some in-house molecules as papain-like protease (PLpro) inhibitors. Journal of Biomolecular Structure and Dynamics. 2020; 39 :4764–4773. doi: 10.1080/07391102.2020.1780946. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Aptula A.O., Roberts D.W. Mechanistic applicability domains for nonanimal-based prediction of toxicological end points: General principles and application to reactive toxicity. Chemical Research in Toxicology. 2006; 19 (8):1097–1105. doi: 10.1021/tx0601004. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Ekins S. Computational toxicology: Risk assessment for pharmaceutical and environmental chemicals. Wiley-Interscience; New Jersey: 2007. [ Google Scholar ]
Gajewicz A., Rasulev B., Dinadayalane T.C., Urbaszek P., Puzyn T., Leszczynska D., Leszczynski J. Advancing risk assessment of engineered nanomaterials: Application of computational approaches. Advanced Drug Delivery Reviews. 2012; 64 (15):1663–1693. doi: 10.1016/j.addr.2012.05.014. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Gramatica P. Principles of QSAR models validation: Internal and external. QSAR and Combinatorial Science. 2007; 26 (5):694–701. doi: 10.1002/qsar.200610151. [ CrossRef ] [ Google Scholar ]
Ivanov J., Polshakov D., Kato-Weinstein J., Zhou Q., Li Y., Granet R., Garner L., Deng Y., Liu C., Albaiu D., Wilson J., Aultman C. Quantitative structure–activity relationship machine learning models and their applications for identifying viral 3CLpro- and RdRp-targeting compounds as potential therapeutics for COVID-19 and related viral infections. ACS Omega. 2020; 5 (42):27344–27358. doi: 10.1021/acsomega.0c03682. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Khaldan A., Bouamrane S., En-Nahli F., El-mernissi R., El khatabi K., Hmamouchi R., Maghat H., Ajana M.A., Sbai A., Bouachrine M., Lakhlifi T. Prediction of potential inhibitors of SARS-CoV-2 using 3D-QSAR, molecular docking modeling and ADMET properties. Heliyon. 2021; 7 (3):e06603. doi: 10.1016/j.heliyon.2021.e06603. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
OECD (2007). Guidance document on the validation of (quantitative) structure–activity relationships models . Organisation for Economic Co-operation and Development.
Płonka W., Paneth A., Paneth P. Docking and QSAR of aminothioureas at the SARS-CoV-2 S-protein-human ACE2 receptor interface. Molecules. 2020; 25 (20):4645. doi: 10.3390/molecules25204645. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Puzyn T., Leszczynski J. Towards efficient designing of safe nanomaterials: Innovative merge of computational approaches and experimental techniques. The Royal Society of Chemistry; 2012. [ Google Scholar ]
Rafi M.O., Bhattacharje G., Al-Khafaji K., Taskin- Tok T., Alfasane M.A., Das A.K., Parvez M.A.K., Rahman M.S. Combination of QSAR, molecular docking, molecular dynamic simulation and MM-PBSA: analogues of lopinavir and favipiravir as potential drug candidates against COVID-19. Journal of Biomolecular Structure and Dynamics. 2020:1–20. doi: 10.1080/07391102.2020.1850355. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Tandon H., Chakraborty T., Suhag V. A concise review on the significance of QSAR in drug design. Chemical and Biomolecular Engineering. 2019; 45 doi: 10.11648/j.cbe.20190404.11. [ CrossRef ] [ Google Scholar ]
Tejera E., Munteanu C.R., López-Cortés A., Cabrera-Andrade A., Pérez-Castillo Y. Drugs repurposing using QSAR, docking and molecular dynamics for possible inhibitors of the SARS-CoV-2 Mpro protease. Molecules. 2020; 25 (21):5172. doi: 10.3390/molecules25215172. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Todeschini R., Consonni V. Handbook of molecular descriptors. Wiley-VCH; Weinheim: 2000. [ Google Scholar ]
Tropsha, A., Gramatica, P., & Gombar, V.K. (2003). The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. In QSAR and combinatorial science (22, Issue 1, pp. 69–77). Wiley-VCH Verlag. < 10.1002/qsar.200390007>. [ CrossRef ]
Veerasamy R., Harish J., Avijeet S., Shalini, Christapher P., Agrawal Validation of QSAR models—Strategies and importance. International Journal of Drug Design and Discovery. 2011; 2 :511–519. [ Google Scholar ]

Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD)

Full-length paper
Published: 13 June 2006
Volume 10 , pages 169–186, ( 2006 )

Cite this article

Lennart Eriksson 1 ,
Patrik L. Andersson 2 ,
Erik Johansson 1 &
Mats Tysklind 2

1244 Accesses

119 Citations

Explore all metrics

This paper introduces principal component analysis (PCA), partial least squares projections to latent structures (PLS), and statistical molecular design (SMD) as useful tools in deriving multi- and megavariate quantitative structure-activity relationship (QSAR) models. Two QSAR data sets from the fields of environmental toxicology and environmental chemistry are worked out in detail, showing the benefits of PCA, PLS and SMD. PCA is useful when overviewing a data set and exploring relationships among compounds and relationships among variables. PLS is the regression extension of PCA and is used for establishing QSARs. SMD is essential for selecting informative training and test sets of compounds for QSAR calibration and validation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

An in-depth investigation of the influence of sample size on PCA-MLR, PMF, and FA-NNC source apportionment results

Opera models for predicting physicochemical properties and environmental fate endpoints.

Prioritization of Chemicals Based on Chemoinformatic Analysis

Explore related subjects, abbreviations.

canonical correlation

D-optimal onion design

factor analysis

factorial design

fractional factorial design

highest occupied molecular orbital

linear discriminant analysis

lowest unoccupied molecular orbital

multiple linear regression

neural networks

principal component analysis

polychlorinated biphenyls

principal component regression

partial least squares projections to latent structures

PLS discriminant analysis

quantitative structure-activity relationships

root mean square error of estimation

root mean square error of prediction

ridge regression

soft independent modelling of class analogy

square root

statistical molecular design

support vector machines

Dunn, III, W.J., Quantitative Structure-Activity Relationships (QSAR) , Chemometrics and Intelligent Laboratory Systems, 6 (1989) 181–190.

Eriksson, L. and Johansson, E., Multivariate design and modelling in QSAR , Chemom. Intell. Lab. Syst., 34 (1996) 1–19.

Article CAS Google Scholar

Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M. and Gramatica, P., Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs , Environmental Health Perspectives, 111 (2003) 1361–1375.

Einax, J., Chemometrics in Environmental Chemistry. Springer-Verlag, Berlin, 1995, ISBN 3-540-58943-0.

Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Megavariate analysis of environmental QSAR data. Part II - Investigating very complex problem formulations using hierarchical, non-linear and batch-wise extensions of PCA and PLS. This issue.

Jackson, J.E., A Userś Guide to Principal Components. John Wiley & Sons, Inc., New York, 1991.

Wold, S., Albano, C., Dunn, III, W.J., Edlund, U., Esbensen, K., Geladi, P., Hellberg, S., Johansson, E., Lindberg, W. and Sjöström, M., Multivariate Data Analysis in Chemistry , In Kowalski, B. (Ed.), Chemometrics: Mathematics and Statistics in Chemistry, NATO ISI Series C 138, Reidel, Dordrecht, pp. 2–78, 1984.

Flåten, G.R., Botnen, H., Grung, B. and Kvalheim, O.M., Assigning environmental variables to observed biological changes , Analytical and Bioanalytical Chemistry, 380 (2004) 453–466.

Sjöström, M., Wold, S., Söderström, M., PLS Discriminant Plots , In Gelsema, E.S. and Kanal, L.N. (Eds.), Pattern Recognition in Practice II, Elsevier Science Publishers, North-Holland, pp. 461–470, 1986.

Nouwen, J., Lindgren, F., Hansen, B., Karcher, W., Verhaar, H.J.M and Hermens, J.L.M., Classification of environmentally occurring chemicals using structural fragments and PLS discriminant analysis , Environmental Science and Technology, 31 (1997) 2313–2318.

Frank, I.E. and Friedman, J.H., A statistical view of some chemometrics regression tools , Technometrics, 35 (1993) 109–148.

Article Google Scholar

Eriksson, L., Hermens, J.L.M., Johansson, E., Verhaar, H.J.M. and Wold, S., Multivariate analysis of aquatic toxicity data with PLS , Aquatic Sciences, 57 (1995) 217–241.

Höskuldsson, A., Prediction Methods in Science and Technology - Volume 1 Basic Theory, Thor Publishing, Copenhagen, 1996.

Google Scholar

Eriksson, L., Johansson, E., Kettaneh-Wold, N., Wold, S., Multi- and Megavariate Data Analysis – Principles and Applications , Umetrics Academy, 2001. ISBN: 91–973730–1-X.

Andersson, P.L., Physico-chemical characterization and quantitative structure-activity relationships of PCBs , Ph.D. Thesis, Umeå University, Umeå, Sweden, 2000.

Tysklind, M., Andersson, P.L., Haglund, P., van Bavel, B. and Rappe, C., Selection of polychlorinated biphenyls for use in quantitative structure-activity modelling , SAR and QSAR in Env. Res., 4 (1995) 11–19.

Andersson, P.L., Haglund, P. and Tysklind, M., The internal barriers of rotation for the 209 polychlorinated biphenyls , Environ. Sci. & Pollut. Res., 4 (1997) 75–81.

Andersson, P.L., Haglund, P. and Tysklind, M., Ultraviolet Absorption Spectra of all 209 Polychlorinated Biphenyls Evaluated by Principal Component Analysis , Fresenius J. Anal. Chem., 357 (1997) 1088–1092.

Andersson, P.L., van der Burght, A.S.A.M., van den Berg, M. and Tysklind, M., Multivariate modelling of polychlorinated biphenyl-induced CYP1A Activity in hepatocytes from three different species: Ranking scales and species difference , Environmental Toxicology and Chemistry, 19 (2000) 1454–1463.

Andersson, P.L., Berg, A.H., Bjerselius, R., Norrgren, L., Olsén, H., Olsson, P.E., Örn, S. and Tysklind, M., Bioaccumulation of selected PCBs in zebra fish, three-spined stickleback and Arctic char after three different routes of exposure , Arch. Environ. Contam. Toxicol, 40 (2001) 519–530.

Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Multivariate biological profiling and principal toxicity regions of compounds: The PCB case study , Journal of Chemometrics, 16 (2002) 497–509.

Eriksson, L., Johansson, E., Lindgren, F., Sjöström, M. and Wold, S., Megavariate analysis of hierarchical QSAR data , Journal of Computer-Aided Molecular Design, 16 (2002) 711–726.

Pirselova, K., Balaz, S., Ujhelyova, R., Sturdik, E., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part I - growth inhibition of escherichia coli by nonionizable kojic acid derivatives , Quantitative Structure-Activity Relationships, 15 (1996) 87–93.

Pirselova, K., Balaz, S., Sturdik, E., Ujhelyova, R., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part II - growth inhibition of escherichia coli by ionizable and nonionizable kojic acid derivatives , Quantitative Structure-Activity Relationships, 16 (1997) 283–289.

Oprea, T.I. and Gottfries, J., Toward minimalistic modelling of oral drug absorption , J. Mol. Graph. Mod., 17 (1999) 261–274.

Oprea, T.I. and Gottfries, J., Chemography: The art of navigating in chemicals space , J. Comb. Chem., 3 (2001) 157–166.

Oprea, T.I., Gottfries, J., Sherbukhin, V., Svensson, P. and Kühler, T.C., Chemical information management in drug discovery: Optimizing the computational and combinatorial chemistry interfaces , Journal of Molecular Graphics and Modelling, 18 (2000) 512–524.

Raevsky, O.A. and Skvortsov, V.S., 3D Hydrogen bond thermodynamics (HYBOT) potentials in molecular modelling , Journal of Computer-Aided Molecular Design, 16 (2002) 1–10.

Eriksson, L., Gottfries, J., Johansson, E. and Wold, S., Time-resolved QSAR: an approach to PLS modelling of three-way biological data , Chemometrics and Intelligent Laboratory Systems, 73 (2004) 73–84.

Wold, S., Cross validatory estimation of the number of components in factor and principal component models , Technometrics, 20 (1978) 397–405.

Hellberg, S., A Multivariate Approach to QSAR , PhD Thesis, Umeå University, Umeå, Sweden, 1986.

Lundstedt, T., A QSAR strategy for screening of drugs and predicting their clinical activity , Drug News Persp., 4 (1991) 468–475.

Wu, J., Hammarström, L.G., Claesson, O. and Fängmark, I.E., Modelling the influence of physico-chemical properties of volatile organic compounds on activated carbon adsorption capacity , Carbon, 41 (2003) 1309–1328.

Carlson, R. and Carlson, J.E., Design and Optimization in Organic Synthesis. Second revised and enlarged edition , Elsevier, 2005.

Winiwarter, S., Bonham, N.M., Ax, F., Hallberg, A., Lennernäs, H. and Karlén, A., Correlation of human jejunal permeability (in vivo) of drugs with experimentally and theoretically derived parameters – A multivariate data analysis approach , J. Med. Chem., 41 (1998) 4939–4949.

Linusson, A., Gottfries, J., Lindgren, F. and Wold, S., Statistical molecular design of building blocks for combinatorial chemistry , Journal of Medicinal Chemistry, 43 (2000) 1320–1328.

Giraud, E., Luttmann, C., Lavelle, F., Riou, J.F., Mailliet, P. and Laoui, A., Multivariate data analysis using D-optimal designs, partial least squares, and response surface modelling, A directional approach for the analysis of farnesyltransferase inhibitors , Journal of Medicinal Chemistry, 43 (2000) 1807–1816.

Eriksson, L., Arnhold, T., Beck, B., Fox, T., Johansson, E. and Kriegl, J.M., Onion design and its application to a pharmaceutical QSAR problem , Journal of Chemometrics, 18 (2004) 188–202.

Tysklind, M., Tillitt, D., Eriksson, L., Lundgren, K. and Rappe, C., A toxic equivalency factor scale for polychlorinated dibenzofurans , Fundam.Appl. Toxicol., 22 (1994) 277–285.

Ramos, E.U., Vaes, W.H.J., Verhaar, H.J.M. and Hermens, J.L.M., Polar narcosis: Designing a suitable training set for QSAR studies , Environ. Sci. & Pollut. Res., 4 (1997) 83–90.

CAS Google Scholar

Eriksson, L. and Hermens J.L.M, A Multivariate Approach to Quantitative Structure-Activity and Structure-Property Relationships , In: J. Einax (Ed.), The Handbook of Environmental Chemistry, Vol 2H, Chemometrics in Environmental Chemistry, Springer-Verlag, Berlin, 1995, pp. 135–168.

Todeschini, R. and Consonni, V., Handbook of Molecular Descriptors , Wiley, 2000, ISBN: 3–527–29913–0.

Box, G.E.P, Hunter, W.G. and Hunter J.S., Statistics for Experimenters , John Wiley & Sons, New York, 1978.

De Aguiar, P.F., Bourguignon, B., Khots, M.S., Massart, D.L. and Phan-Than-Luu, R., D-optimal Designs , Chemom. Intell. Lab. Syst., 30 (1995) 199–210.

Olsson, I.M., Gottfries, J. and Wold, S., D-optimal onion design in statistical molecular design , Chemometrics and Intelligent Laboratory Systems, 73 (2004) 37–46.

Olsson, I.M., Gottfries, J. and Wold, S., Controlling coverage of D-optimal onion designs and selections , Journal of Chemometrics, 18 (2004) 548–557.

Baroni, M., Clementi, S., Cruciani, G., Kettaneh-Wold, N. and Wold, S., D-optimal designs in QSAR , Quant. Struct.-Act. Relat., 12 (1993) 225–231.

Wold, S. and Dunn, III, W.J., Multivariate quantitative structure-activity relationships: Conditions for their applicability , J. Chem. Inf. Comp. Sci., 23 (1983) 6–13.

Eriksson, L., Johansson E. and Wold, S., QSAR Model Validation , Proceedings of the 7th International Workshop on QSAR in Environmental Sciences, SETAC Press, Pensacola, FL, 1997.

Tropsha, A., Gramatica, P. and Gombar, V.J., The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSAR models , QSAR and combinatorial science, 22 (2003) 69–77.

Lindgren, F., Third Generation PLS – Some Elements and Applications , PhD Thesis, Umeå University, Umeå, Sweden, 1994.

Blanco, M., Coello, J., Iturriaga, H., Maspoch, S. and Pagès, J., NIR calibration in non-linear systems: Different pls approaches and artificial neural networks , Chemom. Intell. Lab. Systs., 50 (2000) 75–82.

Norinder, U., Support vector machine models in drug design: Applications to drug transport processes and QSAR using simplex optimisations and variable selection , Neurocomputing, 55 (2003) 337–346.

Wold, S., Sjöström, M. and Eriksson, L., PLS-regression: A basic tool of chemometrics , Chemometrics and Intelligent Laboratory Systems, 58 (2001) 109–130.

Kettaneh, N., Berglund, S. and Wold, S., PCA and PLS with very large data sets , Computational Statistics & Data Analysis, 48 (2005) 69–85.

Download references

Author information

Authors and affiliations.

Umetrics AB, POB 7960, S-907 19, Umeå, Sweden

Lennart Eriksson & Erik Johansson

Institute of Environmental Chemistry, Department of Chemistry, Umeå University, S-901 87, Umeå, Sweden

Patrik L. Andersson & Mats Tysklind

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lennart Eriksson .

Rights and permissions

Reprints and permissions

About this article

Eriksson, L., Andersson, P.L., Johansson, E. et al. Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD). Mol Divers 10 , 169–186 (2006). https://doi.org/10.1007/s11030-006-9024-6

Download citation

Received : 06 October 2005

Accepted : 02 February 2006

Published : 13 June 2006

Issue Date : May 2006

DOI : https://doi.org/10.1007/s11030-006-9024-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

megavariate data analysis
Find a journal
Publish with us
Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

View all journals
Explore content
About the journal
Publish with us
Sign up for alerts
Open access
Published: 07 June 2024

QSPR/QSAR study of antiviral drugs modeled as multigraphs by using TI’s and MLR method to treat COVID-19 disease

Ugasini Preetha P 1 na1 ,
M. Suresh 1 na1 ,
Fikadu Tesgera Tolasa 2 na1 &
Ebenezer Bonyah 3 na1

Scientific Reports volume 14 , Article number: 13150 ( 2024 ) Cite this article

764 Accesses

Metrics details

Biotechnology
Mathematics and computing
Nanoscience and technology

The ongoing COVID-19 pandemic continues to pose significant challenges worldwide, despite widespread vaccination. Researchers are actively exploring antiviral treatments to assess their efficacy against emerging virus variants. The aim of the study is to employ M-polynomial, neighborhood M-polynomial approach and QSPR/QSAR analysis to evaluate specific antiviral drugs including Lopinavir, Ritonavir, Arbidol, Thalidomide, Chloroquine, Hydroxychloroquine, Theaflavin and Remdesivir. Utilizing degree-based and neighborhood degree sum-based topological indices on molecular multigraphs reveals insights into the physicochemical properties of these drugs, such as polar surface area, polarizability, surface tension, boiling point, enthalpy of vaporization, flash point, molar refraction and molar volume are crucial in predicting their efficacy against viruses. These properties influence the solubility, permeability, and bio availability of the drugs, which in turn affect their ability to interact with viral targets and inhibit viral replication. In QSPR analysis, molecular multigraphs yield notable correlation coefficients exceeding those from simple graphs: molar refraction (MR) (0.9860), polarizability (P) (0.9861), surface tension (ST) (0.6086), molar volume (MV) (0.9353) using degree-based indices, and flash point (FP) (0.9781), surface tension (ST) (0.7841) using neighborhood degree sum-based indices. QSAR models, constructed through multiple linear regressions (MLR) with a backward elimination approach at a significance level of 0.05, exhibit promising predictive capabilities highlighting the significance of the biological activity \(IC_{50}\) (Half maximal inhibitory concentration). Notably, the alignment of predicted and observed values for Remdesivir’s with obs \({pIC_{50} = 6.01}\) ,pred \({pIC_{50} = 6.01}\) ( \(pIC_{50}\) represents the negative logarithm of \(IC_{50}\) ) underscores the accuracy of multigraph-based QSAR analysis. The primary objective is to showcase the valuable contribution of multigraphs to QSPR and QSAR analyses, offering crucial insights into molecular structures and antiviral properties. The integration of physicochemical applications enhances our understanding of factors influencing antiviral drug efficacy, essential for combating emerging viral strains effectively.

Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing

Biological activity-based modeling identifies antiviral leads against SARS-CoV-2

Protracted molecular dynamics and secondary structure introspection to identify dual-target inhibitors of Nipah virus exerting approved small molecules repurposing

Introduction.

Graph theory has seen a surge in its application to pharmacology and medicine, with chemical graph theoreticians focusing on computing topological indices of drug structures to gain insights into molecular properties and aid in drug development. SARS-CoV-2, a single-stranded RNA virus, causes COVID-19, the first major pandemic of the twenty-first century. In 2003, SARS, caused by a new corona virus strain, led to 916 deaths globally. Similarly, COVID-19 emerged in December 2019, originating in Wuhan, China, and was declared a global public health emergency by the WHO in January 2020 1 . We are in the half past of 2023, but still, we are facing the corona virus pandemic situation. As of May 12, 2024, 10:39am CEST, the World Health Organization (WHO) has reported a global total of 775,379,864 confirmed COVID-19 cases, with 7 million recorded fatalities. For the latest statistics, refer to https://covid19.who.int/ .

Our research, extending on prior studies highlighting double bonds, could improve correlation results in molecular modeling. Our study is inspired by previous research such as that by Kier et al.’s 2 observation in “Medicinal Chemistry: A Series of Monographs” about double-edge counts providing a more accurate representation of double bonds. Recent work by Simon et al. also indicated improved correlations for molecules with weighted Wiener indices compared to traditional Wiener indices for simple graphs, while Zakharov et al. proposed a novel approach using multigraphs for enhanced statistical QSAR model building 3 , 4 . Using these insights, by these insights, we conducted a comparative analysis between simple and complex models to investigate the impact of double bonds on property estimation accuracy. Topological indices analyze the structure-property relationships in chemical compounds, providing numerical parameters for QSPR and QSAR studies. The research on TI’s has led to the development of over 3000 indices, reflecting the structural properties of the graphs used for their calculation. Most recently, Sakander Hayat et al. research explores the use of temperature-based topological indices, valency-based descriptors, distance-based graphical indices, and eigenvalues-based indices to predict physicochemical and thermodynamic properties of polycyclic aromatic hydrocarbons and benzenoid hydrocarbons 5 , 6 , 7 , 8 , 9 , 10 . Recently, QSPR/QSAR analysis on the antiviral drugs, corona drugs and anticancer drugs has been analyzed using degree/reverse degree/distance/neighborhood based topological descriptors 11 , 12 , 13 , 14 , 15 , 16 . Zaman et al. 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 research delves into diverse applications of analytical and theoretical studies in chemistry and related fields, focusing on structural analysis, topological characterization, and mathematical modeling of various nanostructures, biochemical networks, and metal-organic models. The author’s work explores the relationships between molecular topology, irregular molecular descriptors, and novel topological indices, offering insights into the structural properties of complex materials and nanostructures.

This article represents chemical structures using hydrogen suppressed molecular multigraphs with the inclusion of double bonds. A multigraph is a graph containing multiple edges, where multiple edges indicate more than one connection between two vertices, and loops represent edges connecting the same vertex at both ends 27 . Marrero Ponce in 28 discusses the application of QSPR/QSAR analysis for pseudo-graphs (graphs with loops and parallel edges), with considerations for hetero-atoms using the Valence delta concept 29 . This study compares multigraph and simple graph modeling approaches using topological structure descriptors to estimate physicochemical and biological activity through QSPR/QSAR analysis. Multiple linear regression techniques validate correlation values, aiding in understanding estimators and identifying potential drugs. Notably, no previous literature directly compares multigraph and simple graph efficacy in this context, making this study’s contribution novel and original.

In this study, multigraphs are employed to establish correlations between the physicochemical properties and biological activity of the antiviral drugs. Our QSAR model, utilizing multigraphs, demonstrates a stronger association between the studied biological activity \((pIC_{50})\) with the topological indices compared to the QSAR model proposed by Kirmani et al. 11 . Scientific literature has introduced several graph polynomials to aid in the calculation of various graph indices. Distance-based polynomials like the Hosoya polynomial, PI polynomial, Schultz polynomial, and modified Schultz polynomial have been suggested in previous studies see 30 , 31 , 32 . In addition, Deutsch and Klavzar (2015) 33 developed the M-polynomial as a means to compute different degree-based TI’s.

The M-polynomial of graph \(\mathscr {G}\) is defined in the following manner

In this context, \(m_{jk}\) represents the count of edges uv \(\in\) \(E(\mathscr {G})\) , where \(d_u\) and \(d_v\) are the degrees of vertices u and v, respectively, and (j, k) corresponds to their respective degrees. The NM-polynomial, akin to the M-polynomial, is a polynomial designed specifically for neighborhood degree sum-based indices 34 . It serves a similar purpose and its definition is as follows:

Here \(nm^{*}_{jk}\) represents the count of edges uv \(\in\) \(E(\mathscr {G})\) , where \(nd^{*}_u\) , \(nd^{*}_v\) = (j,k) respectively. \(nd^{*}_u\) , \(nd^{*}_v\) denotes the neighborhood degree of the vertices u and v in the graph respectively. The objective of this research is to create reliable QSPR/QSAR models that can effectively forecast the physical/chemical and biological properties of drugs targeting COVID-19. Throughout the article, the abbreviations ‘NBD’ (neighborhood degree sum-based indices) and ’D’ (Degree based indices) are used in specific sections for convenience.

Material and method

In our study, we utilized algebraic polynomials to determine the topological indices of several antiviral drugs’ structures, our analysis yielded important findings in this regard. Table 1 presents the relationship between different TI’s derived from the M-polynomial and NM-polynomial and the range of integration defined in Table 1 as x = 1 and y = 1 is proved by Sandi Klavžar in 33 . Neighborhood degree sum-based topological indices, as discussed in references 35 , 36 which demonstrates a remarkable capability to predict various physicochemical properties with high accuracy. Furthermore, a parallel effort has led to the construction of several other neighborhood degree sum-based topological indices, along with their corresponding classical degree-based topological indices, as detailed in references 37 , 38 , 39 . Mondal et al. conducted a study 28 to assess the efficacy of four antiviral drugs in the treatment of COVID-19 patients. The study employed the M-polynomial and NM-polynomial methods for evaluation purposes. Additionally, Kirmani et al. 11 recently developed QSPR/QSAR models utilizing linear and multiple linear regression to establish relationships between physicochemical/biological properties and potential antiviral drugs using TI’s in the context of COVID-19 treatment.

To model the antiviral activity of drugs investigated for COVID-19 treatment, a combination of ten ’D’ and ten ’NBD’ based TI’s, alongside eight physicochemical properties, such as polar surface area, polarizability, surface tension, boiling point, enthalpy of vaporization, flash point, molar refraction and molar volume, were employed. The study focused on analyzing the drugs Hydroxychloroquine, Theaflavin, Lopinavir, Ritonavir, Arbidol, Chloroquine and Remdesivir. Thalidomide was excluded from the QSAR study due to insufficient available data on its antiviral activity. Fig. 1 displays the chemical structures of these drugs. We utilized ChemSketch to generate visual representations of the below chemical drug structures. Within this article, the QSAR model incorporates the biological activity \(IC_{50}\) (Half maximal inhibitory concentration) to predict the antiviral activity of the mentioned drugs. Multiple linear regression (MLR) is employed as the statistical technique for this purpose. \(IC_{50}\) is a widely used measure in drug development to assess the strength of potential drug candidates and compare their efficacy. It is also used in biochemical studies to understand the properties of proteins and enzymes. \(pIC_{50}\) represents the negative logarithm of \(IC_{50}\) . The physicochemical properties and biological activity data of the antiviral drugs mentioned are presented in Table 2 . These values were sourced from ChemSpider and the half-maximal inhibitory concentrations ( \(IC_{50}\) ) of antiviral activity for the compounds were collected from the scientific literature 11 , 40 , 41 , 42 , 43 . and converted to their negative logarithmic scale ( \(pIC_{50}\) ) to facilitate data analysis and interpretation.

Chemical structures of ( a ) Lopinavir, ( b ) Ritonavir, ( c ) Arbidol, ( d ) Thalidomide, ( e ) Chloroquine, ( f ) Hydroxy-chloroquine, ( g ) Theaflavin, ( h ) Remdesivir.

Results and discussions

Computation of m-polynomial and nm-polynomial of lopinavir.

In this section, we present the significant computational findings of our study. Our focus was on analyzing the molecular multigraph of lopinavir and deriving its M-polynomial and NM-polynomial, as described in the theorem below. Subsequently, we expanded our analysis to encompass seven additional molecular drug structures. We performed calculations to obtain the M-polynomial and NM-polynomial equations for each structure, and their corresponding values can be found in Table 3 . Only lopinavir computation part is shown and Fig. 2 shows molecular multigraph of lopinavir. Figure 3 shows the 3D-Plot of M-polynomial and NM-polynomial of Lopinavir. From this observation the differences in the surface patterns imply that the degree-based and neighborhood degree-based topological indices derived from these polynomials will also differ in their numerical values and interpretations. To determine the superiority of one index over another, further analysis is required, such as comparing their performance in QSPR/QSAR models, evaluating their correlation coefficients with experimental data, and assessing their ability to discriminate between different molecular structures.

Molecular multigraph of Lopinavir.

Let \(\mathscr {L}\) be the molecular multigraph of Lopinavir. Then we have ,

Consider \(\mathscr {L}\) as the molecular multigraph representing Lopinavir (refer to Fig. 2 ). It comprises a total of 61 edges. Let \(\Gamma _{(j,k)}\) represent the collection of edges where the endpoints have degrees i and j, respectively. (i.e.) \(\Gamma _{(j,k)} = \{uv \in E(\mathscr {L}): \Delta (u) = j, \Delta (v) = k \}\) . Let \(m_{(i,j)}\) be the no.of edges in \(\Gamma _{(j,k)}\) . From 2 it is clear that \(m_{(1,3)} = 3, m_{(1,4)} = 2, m_{(2,2)} = 4, m_{(2,3)} = 7, m_{(2,4)} = 13, m_{(3,3)} = 18, m_{(3,4)} = 11, m_{(4,4)} = 3\) . To derive the M-polynomial of G, we use Eq. 1 .

By using the values of \(m_{(j,k)}\) , we get

Let \(\Gamma ^{*}_{(j,k)}\) as the set of all edges in which the neighborhood degree sum of the endpoints corresponds to degrees i and j, respectively. (i.e.,) \(\Gamma ^{*}_{(j,k)} = \{uv \in E(\mathscr {L}): \Delta (u) = j, \Delta (v) = k \}\) . Let \(nm^{*}_{(i,j)}\) be the no.of edges in \(\Gamma ^{*}_{(j,k)}\) . From 2 it is clear that \(nm^{*}_{(3,5)} = 2, nm^{*}_{(3,6)} = 1, nm^{*}_{(4,4)} = 1, nm^{*}_{(4,5)} = 1, nm^{*}_{(4,6)} = 3, nm^{*}_{(4,7)} = 4, nm^{*}_{(4,8)} = 2, nm^{*}_{(5,9)} = 1, nm^{*}_{(5,10)} = 1, nm^{*}_{(6,6)} = 10, nm^{*}_{(6,7)} = 14, nm^{*}_{(6,10)} = 1, nm^{*}_{(7,7)} = 3, nm^{*}_{(7,8)} = 11, nm^{*}_{(7,9)} = 1, nm^{*}_{(7,10)} = 1, nm^{*}_{(8,10)} = 3, nm^{*}_{(9,10)} = 1\) . To derive the NM-polynomial of G, we use Eq. ( 2 ).

The M-polynomial and NM-polynomial are computed to derive a range of ’D’ and ’NBD’ TI’s for the molecular multigraph representing Lopinavir. These findings are summarized in the following theorem. \(\square\)

Let \(\mathscr {L}\) be the molecular multigraph of Lopinavir. Then, their respective values in Table 3 holds .

3D-plot generation of ( a ) M-polynomial and ( b ) NM-polynomial of Lopinavir.

Initially, we determine the degree-based indices by referring to Table 1 . Let \(M(\mathscr {L};x,y) = t(x,y) = 3xy^{3}+2xy^{4}+4x^{2}y^{2}+7x^{2}y^{3}+13x^{2}y^{4}+18x^{3}y^{3}+11x^{3}y^{4}+3x^{4}y^{4}\) . Then we have,

\(M_1(\mathscr {L}) = (D_x+D_y)t(x,y)|_{x=y=1} =12xy^{3}+10xy^{4}+16x^{2}y^{2}+35x^{2}y^{3}+78x^{2}y^{4}+108x^{3}y^{3}+77x^{3}y^{4} +24x^{4}y^{4} = 360.\)

\(M_2(\mathscr {L}) = (D_xD_y)t(x,y)|_{x=y=1} = 9xy^{3}+8xy^{4}+16x^{2}y^{2}+42x^{2}y^{3}+104x^{2}y^{4}+162x^{3}y^{3}+132x^{3}y^{4}+48x^{4}y^{4}\)

\(mM_2(\mathscr {L}) = S_xS_yt(x,y)|_{x=y=1} = xy^{3}+\frac{2}{4}xy^{4}+x^{2}y^{2}+\frac{7}{6}x^{2}y^{3}+\frac{13}{8}x^{2}y^{4}+\frac{18}{9}x^{3}y^{3}+\frac{11}{12}x^{3}y^{4}+\frac{3}{16}x^{4}y^{4} = 8.3958\)

\(ReZG_3(\mathscr {L}) = D_xD_y(D_x+D_y)t(x,y)|_{x=y=1} = 36xy^{3}+40xy^{4}+64x^{2}y^{2}+210x^{2}y^{3}+624x^{2}y^{4}+972x^{3}y^{3}+924x^{3}y^{4}+384x^{4}y^{4} = 3254\)

\(F(\mathscr {L}) = (D_x^{2}+D_y^{2})t(x,y)|_{x=y=1} = 30xy^{3}+34xy^{4}+32x^{2}y^{2}+91x^{2}y^{3}+260x^{2}y^{4}+324x^{3}y^{3}+275x^{3}y^{4}+96x^{4}y^{4} = 1142\)

\(SDD(\mathscr {L}) = (S_xD_y+S_yD_x)t(x,y)|_{x=y=1} = \frac{30}{3}xy^{3}+\frac{34}{4}xy^{4}+\frac{32}{4}x^{2}y^{2}+\frac{91}{6}x^{2}y^{3}+\frac{260}{8}x^{2}y^{4}+\frac{324}{9}x^{3}y^{3} +\frac{275}{12}x^{3}y^{4}+ \frac{96}{16} = 139.0833\)

\(H(\mathscr {L}) = 2S_xJt(x,y)|_{x=1} = \frac{7}{4}x^{4}+\frac{9}{5}x^{5}+\frac{31}{6}x^{6}+\frac{11}{7}x^{7}+\frac{3}{8}x^{8} = 21.3262\)

\(I(\mathscr {L}) = S_xJD_xD_yt(x,y)|_{x=1} = \frac{25}{4}x^{4}+\frac{50}{5}x^{5}+\frac{266}{6}x^{6}+\frac{132}{7}x^{7}+\frac{48}{8}x^{8} = 85.4405\)

\(A(\mathscr {L}) = S_x^{3}Q_{-2}JD_x^{3}D_y^{3}t(x,y)|_{x=1} = 42.125x^{2}+60.7407x^{3}+309.0313x^{4}+152.064x^{4}+56.8889x^{6} = 620.8499\)

\(R_{\alpha }(\mathscr {L}) = D_x^{\alpha }D_y^{\alpha }t(x,y)|_{x=1} 3(3)^{\alpha }+2(4)^{\alpha }+4(4)^{\alpha }+7(6)^{\alpha }+13(8)^{\alpha }+18(9)^{\alpha }+11(12)^{\alpha }+3(16)^{\alpha } = 22.1114\)

Next, we compute the neighborhood degree sum-based indices by taking into account \(NM^{*}(\mathscr {L}) = t(x,y) = 2x^{3}y^{5}+x^{3}y^{6}+x^{4}y^{4}+x^{4}y^{5}+3x^{4}y^{6}+4x^{4}y^{7}+2x^{4}y^{8}+x^{5}y^{9}+x^{5}y^{10}+10x^{6}y^{6}+14x^{6}y^{7}+x^{6}y^{10}+3x^{7}y^{7}+11x^{7}y^{8}+x^{7}y^{9}+x^{7}y^{10}+3x^{8}y^{10}+x^{9}y^{10}\) . By utilizing the edge partition of \(\Gamma ^{*}_{(j,k)}\) in combination with Table 1 , the NM-polynomial can be derived, thus concluding the proof. The obtained values of the ’D’ & ’NBD’ indices, calculated using the M-polynomial and NM-polynomial, are displayed in Tables 3 and 4 , respectively. \(\square\)

QSPR analysis of selected antiviral drugs with its target properties

Regression analyses.

To clarify the physical significance of our results, we have included concise discussions on the effectiveness of the computed topological indices. These quantitative measures reveal key structural attributes, with higher values indicating enhanced stability and lower reactivity, and lower values suggesting potential reactivity sites. Our study validates the predictive power of these indices by demonstrating strong correlations with experimental properties, supporting their use in understanding structure-property relationships and guiding drug design and development. We highlight the practical applications in drug delivery and material design while acknowledging the need to consider molecular context and explore advanced methods for improved accuracy.The correlated values between ‘D’ and ‘NBD’ based TI’s and the physicochemical properties of antiviral drugs (COVID-19 drugs) can be observed in Tables 5 and 6 . From Table 5 we observe that inverse sum indeg index (estimator) reflects a strong positive relationship with boiling point(outcome variable) which is depicted in Fig. 4 .

Inverse sum indeg index versus predicted boiling point.

Comparison chart of ‘r’ values for multigraph versus simple graph: ‘D’.

From Fig. 5 we observe that the high correlation coefficients ‘r’ values for the physicochemical properties of Surface tension(ST), Molar refractivity(MR), Molar volume(MV) and Polarizability(P) are higher than the simple graph’s representation of selected antiviral drugs. The existence of a double bond in a molecule can greatly impact its properties, including polarity, conjugation, and reactivity. These changes, in turn, can impact the molecule’s solubility, stability, and biological activity. For example when a molecule contains a double bond, it introduces regions of different electron density, resulting in a shift in polarity. The presence of the double bond can make the molecule more polar or less polar depending on the surrounding atoms and functional groups. We observe that molecular multigraphs can provide a more detailed and nuanced representation of the chemical structure and the high correlation coefficients ’r’ of simple graph representing seven drugs for the physicochemical properties of MR with r = 0.9709, P = 0.9710, ST = 0.5115 and MV = 0.9108 using degree based indices from 11 . One can see the high correlation ‘r’ values of molecular multigraph in Table 5 , bold values with an asterisk*. In similar fashion, From Table 6 we observe that Neighborhood Inverse sum indeg index(NI) (predictor variable) reflects a strong positive relationship with Boiling point(outcome variable) which is depicted in Fig. 6 .

Neighborhood inverse sum indeg index versus predicted boiling point.

Comparison chart of ‘r’ values for multigraph versus simple graph: ‘NBD’.

From Fig. 7 we observe that the high correlation coefficients ’r’ values for the physicochemical properties of Flash point(FP) and Surface tension(ST) are higher than the simple graph’s representation of selected antiviral drugs. The high correlation coefficients ’r’ of simple graph representing seven drugs for the physicochemical properties of FP with r = 0.9629 and ST with r = 0.6682 using Neighborhood degree sum based indices from 11 . One can see the high correlation ’r’ values of molecular multigraph in Table 6 , bold values with an asterisk *.

Note: We also have observed that the highly correlated values in the multigraph are nearly identical to the values found in the simple graph for both ’D’ and ’NBD’ based correlation values for example, BP with 0.9920, E with 0.9887 from 11 representing as simple graphs whereas for multigraphs BP with 0.9864 and E with 0.9827, we get a small variance with the correlation values and some are higher than the simple graph. However, when there is a low correlation between chemical structure descriptors and a target property, it suggests that additional factors may play a more significant role in determining the target property. Further analysis or experimentation might be necessary to identify and understand those factors.

QSAR analyses of biological activity \(pIC_{50}\) versus degree based & nbd degree sum-based indices as predictors

Within this section, we employed IBM SPSS Statistics Version 27.0.1.0 software. To view url link of this version, visit https://www.ibm.com/support/pages/downloading-ibm-spss-statistics-27010 to carry out multiple linear regression analyses. \(IC_{50}\) were used as dependent variable and several ’D’ and ’NBD’ based indices, (one can refer Table 1 ) were used as independent variables. \(IC_{50}\) , also known as half maximal inhibitory concentration, is a parameter that measures the effectiveness of a drug or compound in inhibiting a specific biological or biochemical process. It represents the concentration at which the drug can block the target protein’s function by 50 %. \(pIC_{50}\) is a transformed version of \(IC_{50}\) , where the “p” stands for the negative logarithm (base 10) of the \(IC_{50}\) value. \(pIC_{50}\) are used in regression analyses over \(IC_{50}\) since it is linearly related to the drug potency than \(IC_{50}\) . The selection of the optimal multiple linear regression model was based on these statistical criteria: Fisher ratio (F), squared multiple correlation coefficient \((R^2)\) , adjusted correlation coefficient \((R^{2}_{adj})\) , Durbin–Watson value (DW), variance inflation factor (VIF), tolerance value and significance (Sig). The main difference between QSPR and QSAR is the type of property that is being predicted. QSPR models utilize statistical and mathematical methods to establish a link between the molecular structure of compounds and their physicochemical properties. On the other hand, QSAR models employ statistical and machine learning techniques to establish a correlation between the molecular structure of compounds and their biological activities.

MLR model and MLR analyses

Multiple linear regression (MLR) 55 is a statistical technique that explores the relationship between a dependent variable and multiple independent variables. Its purpose is to find the best-fitting regression line that minimizes the differences between the predicted and actual values of the dependent variable. MLR is a statistical method that explores the linear relationship between target variable Y \((pIC_{50})\) and predictor variables X (2D descriptors). Through the least squares curve fitting technique, MLR calculates regression coefficients \((r^2)\) to estimate the model. This approach establishes a straight line equation that accurately represents the overall data points. The regression equation is formulated as follows:

In the regression equation, the dependent variable is represented as Y, and the regression coefficients ’b’ correspond to the independent variables ‘I’. The intercept or regression constant is denoted as ‘c’ 56 . Kirmani et al. 11 conducted a QSAR analysis on antiviral drugs represented as simple graphs, suggesting a weak association between biological activity \((pIC_{50})\) and TI’s. Inspired by their approach, we applied a similar analysis using molecular multigraphs for our selected drugs and achieved a well-fitting QSAR model by backward elimination method which will be elaborated in the upcoming section.

Multicollinearity and VIF 57

Multicollinearity refers to high correlation among independent variables, which can result in unstable and unreliable regression coefficient estimates. Variance inflation factor (VIF) is a measure used to evaluate the presence of multicollinearity in regression analysis, commonly utilized in tools such as SPSS and it is defined as \(VIF = \frac{1}{1-R^2}\) . VIF values ranging from 1 to 10 indicate no multicollinearity, while values below 1 or above 10 suggest the presence of multicollinearity. Our regression models showed signs of multicollinearity, as some independent variables had correlation coefficients near 1 and corresponding VIF values outside the ideal range of 1 to 10. This implies that the model may struggle to accurately estimate the individual effects of these correlated variables. Hence, it is crucial to address this issue to ensure the reliability and accuracy of our regression results.

QSAR model for \(pIC_{50}\)

The correlation matrix is a helpful tool for detecting multicollinearity in regression models. It displays the pairwise correlations between multiple variables, indicating the strength and direction of their relationship. By examining the matrix for high correlations between independent variables, we can identify multicollinearity and take appropriate measures to address it. In the Supplementary Table S1 , we present the correlation matrix between various ’D’ and ’NBD’ based indices. In QSAR analysis, one of the primary goals is to identify the most important molecular descriptors or features that are correlated with the target property. When dealing with numerous molecular descriptors in QSAR analysis, including all of them in the model may not be practical. To tackle this issue, variable selection techniques are utilized to identify the most significant descriptors that exhibit strong correlations with the target property. This process helps improve the predictive performance of the model. Stepwise regression is one such variable selection method that is commonly used in QSAR analysis. It involves iteratively adding or removing descriptors based on their statistical significance in predicting the target property. The process continues until no more significant descriptors remain, resulting in a effective model.

We began constructing simple linear regression models using topological indices that had the lowest correlation (specifically, 0.1170 between \(NDe_3\) and \(NmM_2\) ). This led to the development of two mono-parameter models. However, both models demonstrated a weak correlation with \(pIC_{50}\) .

\(n=7, r=0.3976, R^2=0.1581, R_A^{2} = -0.01026, SE=0.4512, F=0.9390, PE=0.2121\)

Here n : Number of drugs used, r(R):simple(multiple) correlation coefficient, \(R_A^{2}\) : adjustable \(R^{2}\) , F: Fisher’s statistics, PE: Probability error.

By employing Stepwise regression analysis, various combinations of two topological indices have been examined. The following bi-parametric model demonstrates significantly improved statistical measures in comparison to its mono-parametric (Model 1).

\(n=7, r=0.7292, R^2=0.5317, R_A^{2}=0.2976, SE=0.3762, F=2.2711, PE= 0.1179\) .

To improve the statistical parameters of the models, trials were conducted to determine the correlation between three combined TI’s and the biological activity \(pIC_{50}\) . However, the resulting model exhibited only marginal improvements in its statistical measures.

\(n=7, r=0.8950, R^2=0.8011, R_A^{2}=0.6022, SE=0.2831, F=4.0282, PE= 0.0501\) .

By applying successive Stepwise regression, a tetra-parametric model was derived, showcasing notable enhancements in the statistical parameters.

\(n=7, r=0.9689, R^2=0.9389, R_A^{2}=0.8167, SE=0.1921, F=7.6844, PE= 0.0154\) .

After employing successive Stepwise regression, a penta-parametric model was obtained, demonstrating enhanced statistical parameters.

\(n=7, r=0.9819, R^2=0.9642, R_A^{2}=0.7854, SE=0.2079, F=5.3922, PE= 0.0090\) .

In the aforementioned QSAR models, the F-value signifies the ratio between the variability accounted for by the model and the remaining variability ascribed to error. This value is used as an indicator of the model’s statistical significance, with a higher F-value suggesting a greater probability of statistical significance. Probability error, also known as a type I error or alpha error, refers to a statistical concept in hypothesis testing, \(PE = \frac{2(1-r^2)}{3\sqrt{n}}\) 56 . The p-value is a statistical measure that evaluates the likelihood of observing the given outcomes if the null hypothesis is true. It quantifies the level of evidence against the null hypothesis, indicating the strength of the observed results. A predetermined significance level, commonly set at 0.05, is used as a threshold to determine the statistical significance of the study findings and decide whether to reject the null hypothesis. In our QSAR models, we encountered insignificant results as our p (alpha) value was greater than 0.05. By selecting the least correlated variable can reduce the problem of pairwise correlations between the variables, it does not account for the possibility of higher-order correlations among the variables (multicollinearity). Pairwise correlation refers to the correlation between two variables. So we remove all the predictor variables included in the model since all our p values are greater than 0.05. To mitigate this problem, we used the backward elimination method. The objective was to identify a subset of predictor variables that exhibited the most robust association with the response variable \((pIC_{50})\) while avoiding the issue of over-fitting the model due to an excessive number of predictors.

Backward elimination method and validation

Backward elimination is a feature selection method used in statistical modeling and machine learning. It aims to identify the most relevant subset of features (independent variables) for a given predictive model. The method starts with a full model that includes all available features and iteratively eliminates features that are found to be non-significant. One can refer the article 58 for QSAR study utilizing TI’s with backward elimination method. By conducting a 2D-QSAR analysis on the biological activity \(pIC_{50}\) of antiviral drugs, we generated multiple QSAR models. During the stepwise regression process, we successfully identified and eliminated five independent variables that exhibited insignificant associations with the \(pIC_{50}\) (biological activity) outcome. Initially, our study encompassed a total of 18 independent(predictor) variables, but after removing the insignificant features, we were left with 13 remaining predictors. The best linear model for \(pIC_{50}\) contains three topological indices \(ReZG_3, NDe_5\) and NH . Through the process of backward elimination, we initially considered all 13 predictors \(M_1\) , F , \(M_2\) , H , SDD , \(mM_2\) , A , NH , I , \(NM_1\) , \(ReZG_3\) , \(NDe_5\) and NI . The aim was to identify the best subset of predictors(independent variables) that displayed a strong association with \(pIC_{50}\) . The selected model, model 3 from Table 7 , demonstrated the best combination of predictors based on various statistical parameters.

Validation: Durbin–Watson statistics and tolerance 59

The Durbin–Watson statistic is used to measure autocorrelation in regression residuals. It ranges from 0 to 4, with 2 indicating no autocorrelation. Autocorrelation occurs when residuals are correlated over time, violating the assumption of independence. The DW statistic helps assess the level of correlation among residuals. A DW value below 2 indicates the presence of positive autocorrelation, while a value above 2 suggests negative autocorrelation. A DW value of 2 indicates the absence of autocorrelation. To evaluate the model’s goodness of fit using the Durbin-Watson (DW) statistic, a value close to 2 indicates no significant autocorrelation in the residuals. This suggests that the model effectively represents the relationship between the variables. In our final QSAR model 3, the DW value is around 2, indicating that the errors are uncorrelated. The concept of tolerance is employed as an indicator of multicollinearity, measuring the correlation among independent variables in a model. It is represented on a scale from 0 to 1, with a higher tolerance value nearing 1 indicating a lower degree of correlation among predictor variables, thus suggesting reduced multicollinearity. Conversely, a low tolerance value close to 0 indicates high correlation among predictors, suggesting a potential issue of multicollinearity.

Backward elimination typically uses a significance threshold (p-value) to determine whether a predictor should be removed from the model. If a predictor already exceeds the significance threshold at the beginning, it is considered non-significant and excluded directly without further evaluation. In our analysis, we found that 8 out of the 13 predictors did not meet the required statistical criteria, such as p-values, VIF, and tolerance values. As a result, these predictors were excluded from further analysis. The statistical parameters indicated that these predictors did not significantly contribute to the model and may have exhibited multicollinearity issues. So 5 independent predictors were carried out for backward elimination which is presented in Table 7 , among which model 3 is the best to predict the biological activity \(pIC_{50}\) based on these statistical criteria \(VIF < 5\) , Tolerance values are not close to zero, DW = 1.850 and all p-values are less than 0.05.

Ordinary residuals or regular residuals 59

Regular Residual \(=\) Observed Value − Predicted Value. In simpler words, a residual signifies the difference between the observed value of the dependent variable and the value estimated by a regression model. It represents the residual error or the remaining variability that the model was unable to explain. They measure the vertical difference between the observed data points and the regression line or curve. The comparison between the actual and independent (predicted) values of the biological activity \(pIC_{50}\) for seven antiviral drugs is presented in Table 8 . Figure 8 illustrates the linear relationship between the actual \(pIC_{50}\) values and the predicted \(pIC_{50}\) values obtained from model 3 for the aforementioned drugs.

Comparison between observed and predicted values of \(pIC_{50}\) .

This study delves into the evaluation of various antiviral drugs for treating COVID-19, utilizing molecular multigraphs to analyze their chemical structures. Through edge partition techniques, M-polynomial and NM-polynomial expressions were derived, leading to the computation of ’D’ and ’NBD’ based indices. The research also involved a thorough QSPR investigation focusing on antiviral drugs as multigraphs, showcasing the predictive power of computed topological indices (TI’s) in determining physicochemical properties. Notably, the inverse sum indeg and neighborhood inverse sum indeg indices exhibited a strong positive correlation with boiling point (BP), surpassing other indices.

Further, QSAR analysis of the biological activity \(pIC_{50}\) of these antiviral drugs were estimated using multiple linear regression in conjunction with backward elimination approach. The results demonstrated that the MLR model was an effective tool for estimating biological activity \(pIC_{50}.\) The validation criteria used were designed to assess the accuracy and predictive capability of the MLR model. The results highlight the effectiveness of the MLR model in estimating \(pIC_{50}\) , with specific TI’s like NH , \(NDe_5\) , and \(ReZG_3\) showing significant predictive potential. Also the observed and predicted \(pIC_{50}\) of the drugs for the best model evaluated using cross validation techniques shows minor variation, resulting in low residuals.

The study highlights the importance of considering multigraphs as graph models, offering a novel perspective on drug connectivity analysis. By diverging from conventional approaches focused on simple graphs, the research has provided insights into optimizing the drug selection process. In conclusion, there remains an open challenge in incorporating chemometric methods statistical and mathematical techniques for analyzing chemical data to further refine these models. Using these techniques, researchers can advance our understanding of drug behavior and improve strategies for enhancing drug effectiveness.

Data availability

The paper includes the information used to verify the study’s findings.

Pillaiyar, T., Manickam, M., Namasivayam, V., Hayashi, Y. & Jung, S. H. An overview of severe acute respiratory syndrome-coronavirus (SARS-COV) 3cl protease inhibitors: Peptidomimetics and small molecule chemotherapy. J. Med. Chem. 59 , 6595–6628 (2016).

Article CAS PubMed PubMed Central Google Scholar

Hite, G. Medicinal Chemistry: A Series of Monographs: By George deStevens 1st edn. (Academic Press, 1964).

Google Scholar

Brezovnik, S., Tratnik, N. & Žigert Pleteršek, P. Weighted Wiener indices of molecular graphs with application to alkenes and alkadienes. Mathematics 9 , 153 (2021).

Article Google Scholar

Zakharov, A. B., Tsarenko, D. K. & Ivanov, V. V. Topological characteristics of iterated line graphs in the QSAR problem: A multigraph in the description of properties of unsaturated hydrocarbons. Struct. Chem. 32 , 1629–1639 (2021).

Article CAS Google Scholar

Hayat, S., Alanazi, S. J. & Liu, J. B. Two novel temperature-based topological indices with strong potential to predict physicochemical properties of polycyclic aromatic hydrocarbons with applications to silicon carbide nanotubes. Phys. Scr. 99 , 055027 (2024).

Hayat, S., Mahadi, H., Alanazi, S. J. & Wang, S. Predictive potential of eigenvalues-based graphical indices for determining thermodynamic properties of polycyclic aromatic hydrocarbons with applications to polyacenes. Comput. Mater. Sci. 238 , 112944 (2024).

Hayat, S. & Liu, J. B. Comparative analysis of temperature-based graphical indices for correlating the total \(\uppi\) -electron energy of benzenoid hydrocarbons. Int. J. Mod. Phys. B 2550047 (2024).

Hayat, S., Khan, A., Ali, K. & Liu, J. B. Structure-property modeling for thermodynamic properties of benzenoid hydrocarbons by temperature-based topological indices. Ain Shams Eng. J. 15 , 102586 (2024).

Hayat, S. Distance-based graphical indices for predicting thermodynamic properties of benzenoid hydrocarbons with applications. Comput. Mater. Sci. 230 , 112492 (2023).

Hayat, S., Suhaili, N. & Jamil, H. Statistical significance of valency-based topological descriptors for correlating thermodynamic properties of benzenoid hydrocarbons with applications. Comput. Theor. Chem. 1227 , 114259 (2023).

Kirmani, S. A. K., Ali, P. & Azam, F. Topological indices and QSPR/QSAR analysis of some antiviral drugs being investigated for the treatment of Covid-19 patients. Int. J. Quantum Chem. 121 , e26594 (2021).

Article CAS PubMed Google Scholar

Bokhary, S. A. U. H., Siddiqui, M. K. A. & Cancan, M. On topological indices and QSPR analysis of drugs used for the treatment of breast cancer. Polycycl. Arom. Compds. 42 , 6233–6253 (2022).

Shirakol, S., Kalyanshetti, M. & Hosamani, S. M. QSPR analysis of certain distance-based topological indices. Appl. Math. Nonlinear Sci. 4 , 371–386 (2019).

Article MathSciNet Google Scholar

Shanmukha, M. C., Basavarajappa, N. S., Shilpa, K. C. & Usha, A. Degree-based topological indices on anticancer drugs with QSPR analysis. Heliyon 6 (2020).

Kirmani, S. A. K., Ali, P., Azam, F. & Alvi, P. A. On ve-degree and ev-degree topological properties of hyaluronic acid-anticancer drug conjugates with QSPR. J Chem. 2021 , 1–23 (2021).

Arockiaraj, M., Greeni, A. & Kalaam, A. Linear versus cubic regression models for analyzing generalized reverse degree based topological indices of certain latest corona treatment drug molecules. Int. J. Quantum Chem. 123 , e27136 (2023).

Zaman, S., Jalani, M., Ullah, A. & Saeedi, G. Structural analysis and topological characterization of sudoku nanosheet. J. Math. (2022).

Ullah, A., Zaman, S., Hamraz, A. & Saeedi, G. Network-based modeling of the molecular topology of fuchsine acid dye with respect to some irregular molecular descriptors. J. Chem. (2022).

Ullah, A., Zaman, S. & Hamraz, A. Zagreb connection topological descriptors and structural property of the triangular chain structures. Phys. Scr. 98 , 025009 (2023).

Article ADS Google Scholar

Zaman, S., Jalani, M., Ullah, A., Ali, M. & Shahzadi, T. On the topological descriptors and structural analysis of cerium oxide nanostructures. Chem. Pap. 77 , 2917–2922 (2023).

Zaman, S., Jalani, M., Ullah, A., Ahmad, W. & Saeedi, G. Mathematical analysis and molecular descriptors of two novel metal-organic models with chemical applications. Sci. Rep. 13 , 5314 (2023).

Article ADS CAS PubMed PubMed Central Google Scholar

Ullah, A., Bano, Z. & Zaman, S. Computational aspects of two important biochemical networks with respect to some novel molecular descriptors. J. Biomol. Struct. Dyn. 42 , 791–805 (2024).

Hakeem, A., Ullah, A. & Zaman, S. Computation of some important degree-based topological indices for γ-graphyne and zigzag graphyne nanoribbon. Mol. Phys. 121 , e2211403 (2023).

Zaman, S., Salman, M., Ullah, A., Ahmad, S. & Abdelgader Abas, M. Three-dimensional structural modelling and characterization of sodalite material network concerning the irregularity topological indices. J. Math. 1–9 (2023).

Zaman, S., Ullah, A. & Shafaqat, A. Structural modeling and topological characterization of three kinds of dendrimer networks. Eur. Phys. J. E 46 , 36 (2023).

Ullah, A., Zaman, S., Hussain, A., Jabeen, A. & Belay, M. Derivation of mathematical closed form expressions for certain irregular topological indices of 2D nanotubes. Sci. Rep. 13 , 11187 (2023).

Trudeau, R. J. Introduction to Graph Theory (Courier Corporation, 2013).

Marrero-Ponce, Y. Linear indices of the “molecular pseudograph’s atom adjacency matrix’’: Definition, significance-interpretation, and application to qsar analysis of flavone derivatives as hiv-1 integrase inhibitors. J. Chem. Inf. Comput. Sci. 44 , 2010–2026 (2004).

Kier, L. & Hall, L. Molecular connectivity VII: Specific treatment of heteroatoms. J. Pharmaceut. Sci. 65 , 1806–1809 (1976).

Stevanović, D. Hosoya polynomial of composite graphs. Discrete Math. 235 (1–3), 237–244 (2001).

KHADIKAR, P. On a novel structural de-scriptor pi. Natl. Acad. Sci. Lett. 23 , 113–118 (2000).

MathSciNet CAS Google Scholar

Schultz, H. P. Topological organic chemistry. 1. Graph theory and topological indices of alkanes. J. Chem. Inf. Comput. Sci. 29 .

Deutsch, E. & Klavžar, S. M-polynomial and degree-based topological indices. arXiv preprint arXiv: 1407.1592 (2014).

Mondal, S., De, N. & Pal, A. On some general neighborhood degree based topological indices. Int. J. Appl. Math. 32 , 1037 (2019).

Shanmukha, M. C., Basavarajappa, N. S., Usha, A. & Shilpa, K. C. Novel neighbourhood redefined first and second Zagreb indices on carborundum structures. J. Appl. Math. Comput. 66 , 263–276 (2021).

Ghorbani, M. & Hosseinzadeh, M. A. Computing abc4 index of nanostar dendrimers. Optoelectron. Adv. Mater. Rapid Commun. 4 , 1419–1422 (2010).

CAS Google Scholar

Graovac, A., Ghorbani, M. & Hosseinzadeh, M. A. Computing fifth geometric-arithmetic index for nanostar dendrimers. J. Discrete Math. Appl. 1 , 33–42 (2011).

Mondal, S., De, N. & Pal, A. On some new neighbourhood degree based indices. Acta Chem. Iasi 27 , 31–46 (2019).

Mondal, S., Siddiqui, M. K., De, N. & Pal, A. Neighborhood m-polynomial of crystallographic structures. Biointerface Res. Appl. Chem. 11 .

Pizzorno, A. et al. In vitro evaluation of antiviral activity of single and combined repurposable drugs against SARS-COV-2. Antiviral Res. 181 , 104878 (2020).

Fan, S. et al. Research progress on repositioning drugs and specific therapeutic drugs for SARS-COV-2. Future Med. Chem. 12 , 1565–1578 (2020).

Jang, M. E. A. Tea polyphenols EGCG and theaflavin inhibit the activity of SARS-COV-2 3cl-protease in vitro. Evid.-Based Complem. Altern. Med. (2020).

Cicka, D. & Sukhatme, V. P. Available drugs and supplements for rapid deployment for treatment of covid-19. J. Mol. Cell Biol. 13 , 232–236 (2021).

Gutman, I. & Trinajstic, N. Graph theory and molecular orbitals: Total pi-electron energy of alternant hydrocarbons. Chem. Phys. Lett. 17 , 535–538 (1972).

Article ADS CAS Google Scholar

Miličević, A., Nikolić, S. & Trinajstić, N. On reformulated Zagreb indices. Mol. Divers. 8 , 393–399 (2004).

Article PubMed Google Scholar

Ranjini, P. S., Lokesha, V. & Usha, A. Relation between phenylene and hexagonal squeeze using harmonic index. Int. J. Graph Theory 1 , 116–121 (2013).

Ghorbani, M. & Hosseinzadeh, M. The third version of Zagreb index. Discrete Math. Algorithms Appl. 5 , 1350039 (2013).

Furtula, B. & Gutman, I. A forgotten topological index. J. Math. Chem. 53 , 1184–1190 (2015).

Article MathSciNet CAS Google Scholar

Randic, M. Characterization of molecular branching. J. Am. Chem. Soc. 97 , 6609–6615 (1975).

Favaron, O., Mahéo, M. & Saclé, J. F. Some eigenvalue properties in graphs (conjectures of graffiti-II). Discrete Math. 111 , 197–220 (1993).

Vukičević, D. & Gašperov, M. Bond additive modelling 1. Adriatic indices. Croatica Chem. Acta 83 , 243–260 (2010).

Fajtlowicz, S. On conjectures of graffiti-II. Congr. Numer. 60 , 187–197 (1987).

MathSciNet Google Scholar

Furtula, B., Graovac, A. & Vukičević, D. Augmented Zagreb index. J. Math. Chem. 48 , 370–380 (2010).

Hosamani, S. M. Computing Sanskruti index of certain nanostructures. J. Appl. Math. Comput. 54 , 425–433 (2017).

Cohen, J., Cohen, P., West, S. G. & Aiken, L. S. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (Routledge, 2013).

Devillers, J. Neural Networks in QSAR and Drug Design (Academic Press, 1996).

Johnson, R. A. & Wichern, D. W. Applied Multivariate Statistical Analysis (2002).

Esmaeili, E. & Shafiei, F. QSAR study on the physico-chemical parameters of barbiturates by using topological indices and MLR method. Bulgar. Chem. Commun. 50 , 44–49 (2018).

James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).

Book Google Scholar

Download references

Author information

These authors contributed equally: Ugasini Preetha P, M. Suresh, Fikadu Tesgera Tolasa and Ebenezer Bonyah.

Authors and Affiliations

Department of Mathematics, College of Engineering and Technology, SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, 603203, India

Ugasini Preetha P & M. Suresh

Department of Mathematics, Dambi Dollo University, Oromia, Ethiopia

Fikadu Tesgera Tolasa

Department of Mathematics Education, Akenten Appiah Menka University of Skills Training and Entrepreneurial Development, Kumasi, Ghana

Ebenezer Bonyah

You can also search for this author in PubMed Google Scholar

Contributions

M.Suresh introduced the parameter and helped in proof reading., Ugasini Preetha .P analyzed, calculated and computed the main results and Fikadu Tesgera Tolasa helped in providing drug properties and in overall management of the article. Ebenezer Bonyah helped in providing software tools and helped in graphical work. Overall the authors are contributed equally to the manuscript.

Corresponding authors

Correspondence to M. Suresh or Fikadu Tesgera Tolasa .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary table s1., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

P, U.P., Suresh, M., Tolasa, F.T. et al. QSPR/QSAR study of antiviral drugs modeled as multigraphs by using TI’s and MLR method to treat COVID-19 disease. Sci Rep 14 , 13150 (2024). https://doi.org/10.1038/s41598-024-63007-w

Download citation

Received : 09 April 2024

Accepted : 23 May 2024

Published : 07 June 2024

DOI : https://doi.org/10.1038/s41598-024-63007-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Antiviral drugs
M-polynomial
NM-polynomial
Molecular multigraphs
Multiple linear regression

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

Explore articles by subject
Guide to authors
Editorial policies

IMAGES

(PDF) A QSAR Study of 3-(phthalimidoalkyl)-pyrazolin-5-ones
(PDF) Databases and QSAR for cancer research
(PDF) A new software for fragment-based QSAR and its applications
(PDF) Master Thesis (Abstract): Acute Toxicity Study and Quantitative
(PDF) Principles of QSAR Modeling: Comments and Suggestions From
(PDF) QSAR Methods

VIDEO

Black Elephants in the Room
Feminist Research
How to Defend Your MS/MPhil/PhD Research Thesis
[QSAR with python: w4-5] MLPregressor
[QSAR with python: w5-3] descriptor calculator
[QSAR with python: w2-4] Pandas

COMMENTS

PDF Development and application of QSAR models for mechanisms related to
Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): ... PhD thesis . Sine Abildgaard Rosenberg ; Division for Diet, Disease Prevention and Toxicology ... To conclude, the QSAR models developed in this PhD project can provide important information on
PDF Methods to Improve the Reliability, Validity and Interpretability of
The results were compared to those of QSAR models generated using sets created by activity binning and a sphere exclusion method. The results indicated that the SOM was able to generate QSAR sets that were representative of the composition of the overall dataset in terms of similarity. The resulting QSAR models were half the size of those
Machine Learning Algorithms for QSPR/QSAR Predictive Model Development
Machine Learning Algorithms for QSPR/QSAR Predictive Model Development Involving High-Dimensional Data. View/ Open Dissertation_Shounak.pdf (2.208Mb) Date 2019-03-21. Author. Datta, Shounak. Type of Degree PhD Dissertation. Department. Chemical Engineering. Metadata
(PDF) Quantitative Structure-Activity Relationship (QSAR): Modeling
approach less biased when compared to 4D QSAR (Vedani & Dobler, 2002). 6D-QSAR improves the former 5D-QSAR strategy by including another dimension for solvation function that helps in analyzing ...
Uncertainty estimation for QSAR models using machine learning methods
Uncertainty estimation for QSAR models using machine learning methods ... (2019) Uncertainty estimation for QSAR models using machine learning methods. PhD thesis, University of Sheffield. Metadata. Supervisors: Gillet, Val and Vessey, Jonathan D. Awarding institution: University of Sheffield: ... Thesis_Founti.pdf. Licence:
Improving the Accuracy and Interpretability of Machine Learning Models
Quantitative structure-activity relationship (QSAR) models provide an alternative to undesired animal studies for this purpose. However, in practice their use is often limited either due to insufficient model accuracy or due to a lack of model interpretability. This thesis addresses current limitations of QSAR models used for toxicity prediction.
Comprehensive ensemble in QSAR prediction for drug discovery
Background Quantitative structure-activity relationship (QSAR) is a computational modeling method for revealing relationships between structural properties of chemical compounds and biological activities. QSAR modeling is essential for drug discovery, but it has many constraints. Ensemble-based machine learning approaches have been used to overcome constraints and obtain reliable predictions ...
PDF Quntitative Structure Activity Relationship(Qsar)
QSAR involves the derivation of mathematical formula which relates the biological activities of a group of compounds to their measurable physicochemical parameters. These parameters have major influence on the drug's activity. QSAR derived equation take the general form:
PhD thesis
PhD thesis. Giacomo Baccolo: Chemometrics approaches for the automatic analysis of metabolomics GC-MS data (2022) Cecile Valsecchi: Advancing the prediction of Nuclear Receptor modulators through machine learning methods (2022) Francesca Grisoni: In silico assessment of aquatic bioaccumulation: advances from chemometrics and QSAR modelling (2016)
New QSAR models based on Markov Chains to predict ...
Request PDF | On Jan 1, 2010, R Concu published New QSAR models based on Markov Chains to predict protein functions (PhD Thesis) | Find, read and cite all the research you need on ResearchGate
Qsar PHD Thesis
Qsar Phd Thesis - Free download as PDF File (.pdf), Text File (.txt) or read online for free. Scribd is the world's largest social reading and publishing site. ...
DPubChem: a web tool for QSAR modeling and high-throughput ...
The DPubChem tool allows the user to simply provide a bioassay accession number (AID) and the system automatically retrieves all relevant information for processing the HTS data of interest. The ...
IJMS
This paper provides an overview of recently developed two dimensional (2D) fragment-based QSAR methods as well as other multi-dimensional approaches. In particular, we present recent fragment-based QSAR methods such as fragment-similarity-based QSAR (FS-QSAR), fragment-based QSAR (FB-QSAR), Hologram QSAR (HQSAR), and top priority fragment QSAR in addition to 3D- and nD-QSAR methods such as ...
PDF Fundamentals of QSAR modeling: basic concepts and applications
To facilitate the consideration of a QSAR model for regulatory purposes, it should be associated with the following information: a defined endpoint. an unambiguous algorithm; a defined domain of applicability. appropriate measures of goodness-of- fit, robustness and predictivity. a mechanistic interpretation if possible.
QSAR/QSPR Modeling: Introduction
Abstract. Development of predictive quantitative structure-activity relationship (QSAR) models plays a significant role in the design of purpose-specific fine chemicals including pharmaceuticals. Considering the wide application of different types of chemicals in human life, QSAR modeling is a useful tool for prediction of biological activity ...
PDF QUANTATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR)
QSAR relates the physicochemical properties of a series of drugs to their biological activity by means of a mathematical equation. The commonly studied physicochemical properties are 1. Hydrophobicity 2. Electronic factors 3. Steric factors QSAR study considers how the hydrophobic, electronic, and steric properties
(PDF) QSAR modeling, pharmacokinetics and molecular docking
QSAR models are dev eloped using selected molecular descriptors and exp erimental values of anticancer activity in the context of the OECD as well as the Golbraikh and Tropsha criteria [48-49].
A quantitative structure-activity relationship (QSAR) study of some
In the current study, both ligand-based molecular docking and receptor-based quantitative structure activity relationships (QSAR) modeling were performed on 35 diaryl urea derivative inhibitors of V600E B-RAF. In this QSAR study, a linear (multiple linear regressions) and a nonlinear (partial least squares least squares support vector machine (PLS-LS-SVM)) were used and compared.
Quantitative structure-activity relationship-based computational
The chapter deals with various quantitative structure-activity relationship (QSAR) techniques currently used in computational drug design and their applications and advantages in the overall drug design process. The chapter reviews current QSAR studies carried out against SARS-COV-2. The QSAR study design is composed of some major facets: (1 ...
Megavariate analysis of environmental QSAR data. Part I
This paper introduces principal component analysis (PCA), partial least squares projections to latent structures (PLS), and statistical molecular design (SMD) as useful tools in deriving multi- and megavariate quantitative structure-activity relationship (QSAR) models. Two QSAR data sets from the fields of environmental toxicology and environmental chemistry are worked out in detail, showing ...
QSPR/QSAR study of antiviral drugs modeled as multigraphs by ...
The aim of the study is to employ M-polynomial, neighborhood M-polynomial approach and QSPR/QSAR analysis to evaluate specific antiviral drugs including Lopinavir, Ritonavir, Arbidol, Thalidomide ...
(PDF) An Introduction to QSAR Methodology
PDF | On Jan 1, 1997, Allen B. Richon and others published An Introduction to QSAR Methodology | Find, read and cite all the research you need on ResearchGate
PDF A Review Article Role of Qsar: Significance and Uses in ...
4 D QSAR: As with the 3 D + multiple representation of ligand conformation. 5 D QSAR: As with the 4 D + multiple representation of induced fit sinerio. 6 D QSAR: As with 5 D + multiple representation of salvation model [11]. PURPOSE OF QSAR: QSAR should not be seen as an academic tool to allow for the post-rationalization of data.

Machine Learning Algorithms for QSPR/QSAR Predictive Model Development Involving High-Dimensional Data

Comprehensive ensemble in QSAR prediction for drug discovery

Conclusions

Experimental setup

Representation of chemical compounds

Experimental configuration and environment

Performance comparison with other approaches

Performance comparison with other ensemble approaches

Performance comparison on other dataset

Meta-learning and interpretation of model importance

Ensemble learning

Combining a set of models

Chemical compound representation

First-level: individual learning

Conventional machine learning methods

Plain feed-forward neural network

Convolutional and recurrent neural networks

Second-level: combined learning

Availability of data and materials

Abbreviations

Acknowledgments

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Consent for publication

Additional information

Rights and permissions

About this article

Share this article

BMC Bioinformatics

Information

Initiatives

Article Menu

JSmol Viewer

1. Introduction

Share and Cite

Article Metrics

Quantitative structure–activity relationship-based computational approaches

Alpana Bastikar

Pramodkumar Gupta

10.1. Introduction

10.2. The importance of quantitative structure–activity relationship

10.3. Requirements to generate a good quantitative structure–activity relationship model

10.4. Applications of quantitative structure–activity relationship in various fields

10.5. The different stages of advancement of quantitative structure–activity relationship

10.5.1. Steps and strategies for quantitative structure–activity relationship

10.6. Molecular descriptors

10.7. Methods of quantitative structure–activity relationship

10.8. Data analysis methods

10.8.2. Statistical methods

10.8.3. Discriminant analysis

10.8.4. Cluster analysis

10.8.5. Principal component analysis

10.8.6. Quantum mechanical methods

10.9. Quantitative structure–activity relationship model validation

10.10. Quantitative structure–activity relationship and Coronavirus disease-2019

10.11. Conclusion

Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD)

Cite this article

Access this article

Similar content being viewed by others

An in-depth investigation of the influence of sample size on PCA-MLR, PMF, and FA-NNC source apportionment results

Prioritization of Chemicals Based on Chemoinformatic Analysis

Author information

Corresponding author

Rights and permissions

About this article

Share this article

QSPR/QSAR study of antiviral drugs modeled as multigraphs by using TI’s and MLR method to treat COVID-19 disease

Similar content being viewed by others

Exploring the SARS-CoV-2 virus-host-drug interactome for drug repurposing

Biological activity-based modeling identifies antiviral leads against SARS-CoV-2

Protracted molecular dynamics and secondary structure introspection to identify dual-target inhibitors of Nipah virus exerting approved small molecules repurposing

Material and method

Results and discussions

QSPR analysis of selected antiviral drugs with its target properties

QSAR analyses of biological activity \(pIC_{50}\) versus degree based & nbd degree sum-based indices as predictors

MLR model and MLR analyses