Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers
© Dybowski et al; licensee BioMed Central Ltd. 2011
Received: 15 June 2011
Accepted: 14 November 2011
Published: 14 November 2011
Maturation inhibitors such as Bevirimat are a new class of antiretroviral drugs that hamper the cleavage of HIV-1 proteins into their functional active forms. They bind to these preproteins and inhibit their cleavage by the HIV-1 protease, resulting in non-functional virus particles. Nevertheless, there exist mutations in this region leading to resistance against Bevirimat. Highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients, who might benefit from the usage of these new drugs.
We tested several methods to improve Bevirimat resistance prediction in HIV-1. It turned out that combining structural and sequence-based information in classifier ensembles led to accurate and reliable predictions. Moreover, we were able to identify the most crucial regions for Bevirimat resistance computationally, which are in line with experimental results from other studies.
Our analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat. New maturation inhibitors are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Thus, accurate prediction tools are very useful to enable a personalized therapy.
HIV and Bevirimat
Bevirimat (BVM) belongs to a new class of antiretroviral drugs inhibiting the maturation of HIV-1 particles to infectious virions. BVM prevents the final cleavage of precursor protein p25 to p24 and p2. In electron microscopy, these immature particles failed to build a capsid composed of a nucleocapsid (p7) and RNA surrounded by a cone-shaped core assembled from p24 proteins . In selection experiments with BVM mutations at Gag cleavage site p24/p2 BVM resistance emerged and was conferred in phenotypic resistance tests. In contrast, especially natural polymorphisms in the QVT-motif of p2 hampered the effective suppression of viral replication in clinical phase II trails, which also increased the measured resistance factors in cell culture experiments. It was recently shown by Keller et al. that BVM stabilizes the immature Gag lattice and thus, prevents cleavage .
Machine learning techniques are widely used to predict drug resistance in HIV-1. For instance, Beerenwinkel et al. used support vector machines  and decision trees  to predict drug resistance of HIV-1 against several protease and reverse transcriptase inhibitors. Other groups also employed artificial neural networks [5, 6], rule-based systems  and random forests [8, 9].
In our recent publication, we demonstrated the use of machine learning techniques for the prediction of Bevirimat resistance from genotype . We tested artificial neural networks, support vector machines, rule-based systems and random forests  trained on p2 sequences derived from resistant and susceptible virus strains and applied different descriptor sets. Descriptors map the amino acid symbols onto numerical values representing physico-chemical properties of the amino acids. Due to the fact that the p2 sequences have insertions and deletions and thus differ in their length, they were preprocessed to fulfill the constraints given by machine learning approaches, i.e. a fixed input dimension of the data. We used a multiple sequence alignment to align and subsequently encode the sequences with five different descriptors, namely the hydrophobicity scale of Kyte and Doolittle , molecular weight, isoelectric point, pKa and HIV-1 cleavage probability . Finally, all models were trained using the encoded protein sequences and evaluated using 100-fold leave-one-out cross-validation. The random forest models trained on hydrophobicity-encoded p2 sequences performed best (AUC = 0.927 ± 0.001) with regard to Wilcoxon signed-rank tests on the AUC distributions. Moreover, earlier studies  have shown that RFs are highly stable and robust in comparison with other classifiers. RFs also provide an importance estimation for the variables in the data set. The importance of each variable, i.e. sequence position in p2, can be assessed to gain a possible biological implications on resistance mechanisms.
Classifier ensembles have been shown to often lead to better prediction performance compared to single classifiers in several studies [15, 16]. The random forests models used in our initial study are already examples of classifier ensembles, consisting of independent decision trees that are based on the same feature set. However, classifier ensembles can also be constructed by combining different datasets or different representations (here descriptors) of the same data. In order to combine the outputs of single classifiers for a final decision of an ensemble, several fusion methods have been proposed, ranging from simple mathematical functions, such as min and max, to second-level learning , also called stacking . In the quest for optimal classifier ensembles, genetic algorithms (GA) have been suggested in various studies [18–20]. Genetic algorithms mimic the idea of evolution and its natural processes of mutation, recombination and selection of individuals. GAs are used to heuristically solve optimization problems with a complex fitness landscape and are frequently applied in biomedical research [21–23]. The central components of a genetic algorithm are the population and the fitness function evaluating the individuals (chromosomes) therein. During each generation the best, i.e. most fit, individuals (parents) are selected by methods such as stochastic universal sampling or tournament selection to generate a new generation of slightly varied offspring. Variations are introduced through so-called genetic operators, e.g. mutation or recombination, that impose the genetic variability and sample the fitness landscape. Generations of individuals are established until one or more termination conditions are reached.
Material and methods
In this study we used the data aggregated by Heider et al., consisting of p2 sequences of viruses with assay-determined resistance factors. These data were collected from several studies that have investigated polymorphisms in the p2 region by phenotypic BVM resistance assays. The cut-off value of the resistance factor used to define the classes "resistant to BVM" and "susceptible to BVM" was set according to Heider et al.. Duplicate sequences in each of the classes were removed prior to analysis. The final dataset consisted of 43 p2 sequences of HIV-1 strains with susceptibility or intermediate resistance to BVM and 112 sequences of resistant strains. The lengths of the p2 sequences in the data set are 20.77 ± 0.43. The p2 sequences have a very low sequence identity/similarity. Only six positions within the peptides are conserved, namely 359-361 and 365-367. Position 357 shows only small similarity, whereas position 358 and 362 show higher similarity among the sequences. All other positions, especially in the N-terminal part, are highly diverse. The highly conserved regions are marked with an asterisk in the wildtype sequence:
Area under the curve
Models were compared based on their resulting AUC distributions from the 10-fold leave-one-out cross-validation runs using Wilcoxon signed-rank tests [26, 27]. The null hypothesis was that there are no differences between the compared classifiers. 95% confidence intervals of the AUC were calculated by t-testing.
The Interpol package  was used to encode p2 sequences using all 531 descriptor sets of the AAindex database  and to normalize the feature length using a linear interpolation , resulting in 531 numerically encoded datasets. The feature length was set to 21, representing the most common sequence length found in the dataset.
Evolutionary Optimization of Descriptors
Individuals were represented as vectors of twenty numerical values (genes), encoding the proteinogenic amino acids. The mutation probability for each gene was 0.01. Recombination was applied with a probability of 0.1. Recombination partners were chosen according to the fitness proportionate selection operator. The fitness of an individual was given by its BVM resistance classification performance (AUC). The GA was run thrice for 1000 generations.
The randomForest package  of R  was used to build all RF models used in this study. Each RF consisted of 500 decision trees that were combined by majority voting. Feature importance was assessed using the built-in function of the randomForest package and estimated by the sum of all decreases in Gini impurity, which has been shown to be more robust compared to the mean decrease in accuracy.
Classifier ensembles CE1 and CE2 were constructed from the set of 531 single classifiers based on the different descriptors in the AAindex database. All single classifiers with an AUC of > 0.93 and > 0.94 were included into CE1 and CE2, respectively. The votes of the single classifiers within CE1 and CE2 were combined by applying simple methods such as min, max, product and mean to reach a final decision. In addition, the classifier ensembles were stacked using a random forest model trained on the outcomes of the single classifiers.
The evolutionary optimization of classifier ensembles was similar to that of a descriptor set, described earlier. The fitness of an ensemble was represented by the resulting performance of that ensemble on the BVM resistance prediction (represented by the AUC). An individual consisted of a set of unique classifiers. Possible classifiers included artificial neural networks, support vector machines , k-nearest neighbors, decision trees and random forests. The minimum and maximum number of classifiers within an ensemble was set to 2 and 10, respectively. Mutations, e.g. insertions or deletions of classifiers, as well as changes to parameters in the specific machine-learning methods, were set to occur at a rate of 0.2. Each of the 100 generations comprised 15 individuals. The resulting classifier ensemble was termed CE.GA. In addition, a second ensemble including only RF models was created applying the same parameters and termed CE.GA.RF.
Homology models of all p2 sequences were built based on the NMR structure of the p2 α-helix  using Modeller 9.8 . The electrostatic hull, representing the discretized electrostatic potential φ(r) above the solvent-accessible surface was calculated as described in the original publications . The resulting hull, calculated for each p2 model consisted of 200 φ(r)-values at a distance of approximately 0.6 nm above the solvent-accessible surface. Electrostatic potential vectors of the form (φ(r 1),..., φ(r 200)) were then used to train initial RF models. To cope with the unfavorable ratio of samples (n = 155) and features (p = 200) , a feature selection scheme was applied. The most important features, i.e. φ(r), as estimated by the RF internal importance analysis , were averaged over ten RF models and sorted in descending order. In an iterative manner, RF models were then built using feature subsets, starting with the most important and adding one additional feature per round. In each round the AUC was calculated.
Results and Discussion
cor.res (R 2)
cor.des (R 2)
In order to test the predictive performance of a structural classifier, we calculated the electrostatic potential resulting from p2 sequences as proposed by Dybowski et al.. This structural classifier based on the electrostatic potential (RF.ESP) reached an AUC of 0.810 ± 0.008. A subsequent model based on the results of a feature selection described in Materials and Methods yielded an AUC 0.898 ± 0.006 using the 32 most important variables according to the RF importance measure. There are different explanations for the inferiority of this structural classifier: (A) At least some of the drug resistance mechanisms witnessed here are not driven by charge. In comparison, a sequence classifier based on the amino acid net charge descriptor reached an AUC of 0.625 ± 0.000. (B) Inaccurate modeling due to limited sequence length. The influence of neighboring residues (primary or tertiary structure) to the electrostatic potential is neglected. (C) Errors in the template structure. Worthylake et al. suggested that the alpha helix formed by the p2 sequence is less stable , in contrast to the p2 structure of Morellet et al.. The stability of the p2 alpha helix might be overestimated because of a high trifluoroethanol concentration used in the experiments. A wrong template structure might ultimately lead to unnatural side-chain placement. At least the second point also applies to the sequence-based classifiers.
HIV-1 drug resistance is a major obstacle in achieving sustained suppression of viral replication in chronically infected patients. The emergence of drug resistance as well as more and more individualized antiretroviral treatment regimens lead to the need for developing new antiretroviral agents for routine clinical practice. BVM was the first drug of the new class of maturation-inhibitors of HIV-1 entering clinical trials. Baseline BVM resistance of about 30% in treatment-naïve HIV-1 isolates and of about 50% in protease inhibitor resistant HIV-1 isolates  hampered the usage of BVM in routine antiretroviral therapy regimens . Nevertheless, new drugs of this new class targeting the p24/p2 junction, e.g. Vivecon (MPC-9055), are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Therefore, highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients who might benefit from the usage of these new drugs.
In the current study, we applied several techniques to improve Bevirimat resistance prediction from p2 sequences of HIV-1. Based on our recently published results, we were able to improve resistance prediction with well chosen descriptors and classifier ensembles. It turned out that combining structural and sequence information can lead to improved prediction performance, as already discussed by Dybowski et al. for co-receptor usage prediction of HIV-1. Combining well chosen sequence-based descriptors does also lead to better prediction performance with no significant differences compared to the combined structure-sequence classifiers. However, it is not useful to combine plenty of classifiers as it can lead to a drop in prediction performance as demonstrated for the CE1. As already shown in other studies, combining classifiers via stacking seems to be useless to improve prediction performance.
This work was supported by the Deutsche Forschungsgemeinschaft (SFB/Transregio 60).
- Salzwedel K, Martin D, Sakalian M: Maturation inhibitors: a new therapeutic class targets the virus structure. AIDS Rev. 2007, 9: 162-172.PubMedGoogle Scholar
- Keller PW, Adamson CS, Heymann JB, Freed EO, Steven AC: HIV-1 maturation inhibitor bevirimat stabilizes the immature Gag lattice. J Virol. 2011, 85 (4): 1420-1428. 10.1128/JVI.01926-10.View ArticlePubMedGoogle Scholar
- Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J: Geno2pheno: Interpreting Genotypic HIV Drug Resistance Tests. IEEE Intelligent Systems. 2001, 16: 35-41. 10.1109/5254.972080.View ArticleGoogle Scholar
- Beerenwinkel N, Schmidt B, Walter H, Kaiser R, Lengauer T, Hoffmann D, Korn K, Selbig J: Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype. Proc Natl Acad Sci USA. 2002, 99 (12): 8271-8276. 10.1073/pnas.112177799.View ArticlePubMedPubMed CentralGoogle Scholar
- Draghici S, Potter RB: Predicting HIV drug resistance with neural networks. Bioinformatics. 2003, 19: 98-107. 10.1093/bioinformatics/19.1.98.View ArticlePubMedGoogle Scholar
- Rhee SY, Taylor J, Wadhera G, Ben-Hur A, Brutlag DL, Shafer RW: Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proc Natl Acad Sci USA. 2006, 103 (46): 17355-17360. 10.1073/pnas.0607274103.View ArticlePubMedPubMed CentralGoogle Scholar
- Kierczak M, Ginalski K, Dramiński M, Koronacki J, Rudnicki W, Komorowski J: A Rough Set-Based Model of HIV-1 Reverse Transcriptase Resistome. Bioinform Biol Insights. 2009, 3: 109-127.PubMedPubMed CentralGoogle Scholar
- Murray RJ, Lewis FI, Miller MD, Brown AJ: Genetic basis of variation in tenofovir drug susceptibility in HIV-1. AIDS. 2008, 22 (10): 1113-23. 10.1097/QAD.0b013e32830184a1.View ArticlePubMedGoogle Scholar
- Dybowski JN, Heider D, Hoffmann D: Prediction of co-receptor usage of HIV-1 from genotype. PLoS Comput Biol. 2010, 6 (4): e1000743-10.1371/journal.pcbi.1000743.View ArticlePubMedPubMed CentralGoogle Scholar
- Heider D, Verheyen J, Hoffmann D: Predicting Bevirimat resistance of HIV-1 from genotype. BMC Bioinformatics. 2010, 11: 37-10.1186/1471-2105-11-37.View ArticlePubMedPubMed CentralGoogle Scholar
- Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Kyte J, Doolittle R: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157: 105-132. 10.1016/0022-2836(82)90515-0.View ArticlePubMedGoogle Scholar
- Chou KC, Tomasselli AG, Reardon IM, Heinrikson RL: Predicting human immunodeficiency virus protease cleavage sites in proteins by a discriminant function method. Proteins. 1996, 24: 51-72. 10.1002/(SICI)1097-0134(199601)24:1<51::AID-PROT4>3.0.CO;2-R.View ArticlePubMedGoogle Scholar
- Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP: Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003, 43: 1947-1958. 10.1021/ci034160g.View ArticlePubMedGoogle Scholar
- Nanni L, Lumini A: Using ensembles of classifiers for predicting HIV protease cleavage sites in proteins. Amino Acids. 2009, 36: 409-416. 10.1007/s00726-008-0076-z.View ArticlePubMedGoogle Scholar
- Wong C, Li Y, Lee C, Huang CH: Ensemble learning algorithms for classification of mtDNA into haplogroups. Briefings in bioinformatics. 2010, 12: 1-9.View ArticlePubMedPubMed CentralGoogle Scholar
- Wolpert D: Stacked generalization. Neural Networks. 1992, 5: 241-260. 10.1016/S0893-6080(05)80023-1.View ArticleGoogle Scholar
- Kuncheva LI, Jain LC: Designing Classifier Fusion Systems by Genetic Algorithms. IEEE Transactions on Evolutionary Computation. 2000, 4 (4): 327-336. 10.1109/4235.887233.View ArticleGoogle Scholar
- Gabrys B, Ruta D: Genetic algorithms in classifier fusion. Applied Soft Computing. 2006, 6 (4): 337-347. 10.1016/j.asoc.2005.11.001.View ArticleGoogle Scholar
- Nanni L, Lumini A: A genetic approach for building different alphabets for peptide and protein classification. BMC bioinformatics. 2008, 9: 45-10.1186/1471-2105-9-45.View ArticlePubMedPubMed CentralGoogle Scholar
- Gronwald W, Hohm T, Hoffmann D: Evolutionary Pareto-optimization of stably folding peptides. BMC Bioinformatics. 2008, 9: 109-10.1186/1471-2105-9-109.View ArticlePubMedPubMed CentralGoogle Scholar
- Kernytsky A, Rost B: Using genetic algorithms to select most predictive protein features. Proteins. 2009, 75: 75-88. 10.1002/prot.22211.View ArticlePubMedGoogle Scholar
- Pyka M, Heider D, Hauke S, Kircher T, Jansen A: Dynamic causal modeling with genetic algorithms. J Neurosci Methods. 2011, 194 (2): 402-406. 10.1016/j.jneumeth.2010.11.007.View ArticlePubMedGoogle Scholar
- Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters. 2006, 27: 861-874. 10.1016/j.patrec.2005.10.010.View ArticleGoogle Scholar
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21 (20): 3940-3941. 10.1093/bioinformatics/bti623.View ArticlePubMedGoogle Scholar
- Wilcoxon F: Individual comparisons by ranking methods. Biometrics. 1945, 1: 80-83. 10.2307/3001968.View ArticleGoogle Scholar
- Demsar J: Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research. 2006, 7: 1-30.Google Scholar
- Heider D, Hoffmann D: Interpol: An R package for preprocessing of protein sequences. BioData Mining. 2011, 4: 16-10.1186/1756-0381-4-16.View ArticlePubMedPubMed CentralGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M: AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36 (Database issue): D202-D205.PubMedGoogle Scholar
- Heider D, Verheyen J, Hoffmann D: Machine learning on normalized protein sequences. BMC Research Notes. 2011, 4: 94-10.1186/1756-0500-4-94.View ArticlePubMedPubMed CentralGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News. 2002, 2 (3): 18-22.Google Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2006, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0Google Scholar
- Calle ML, Urrea V: Letter to the Editor: Stability of Random Forest importance measures. Briefings in bioinformatics. 2010, 12: 86-89.View ArticlePubMedGoogle Scholar
- Karatzoglou A, Smola A, Hornik K, Zeileis A: kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software. 2004, 11 (9): 1-20.View ArticleGoogle Scholar
- Morellet N, Druillennec S, Lenoir C, Bouaziz S, Roques B: Helical structure determined by NMR of the HIV-1 (345-392)Gag sequence, surrounding p2: Implications for particle assembly and RNA packaging. Protein Science. 2004, 14: 375-386.View ArticleGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234 (3): 779-815. 10.1006/jmbi.1993.1626.View ArticlePubMedGoogle Scholar
- Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der ADL, Feskens EJM: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006, 7: 23-View ArticlePubMedPubMed CentralGoogle Scholar
- Qian N, Sejnowski TJ: Predicting the secondary structure of globular proteins using neural network models. Journal of molecular biology. 1988, 202 (4): 865-84. 10.1016/0022-2836(88)90564-5.View ArticlePubMedGoogle Scholar
- Naderi-Manesh H, Sadeghi M, Arab S, Movahedi AAM: Prediction of protein surface accessibility with information theory. Proteins. 2001, 42: 452-459. 10.1002/1097-0134(20010301)42:4<452::AID-PROT40>3.0.CO;2-Q.View ArticlePubMedGoogle Scholar
- Džeroski S, Ženko B: Is Combining Classifiers with Stacking Better than Selecting the Best One?. Machine Learning. 2004, 54 (3): 255-273.View ArticleGoogle Scholar
- Ting KM, Witten IH: Stacked Generalization: when does it work?. International Joint Conference on Artificial Intelligence. 1997Google Scholar
- van Baelen K, Salzwedel K, Rondelez E, Eygen VV, Vos SD, Verheyen A, Steegen K, Verlinden Y, Allaway GP, Stuyver LJ: Susceptibility of human immunodeficiency virus type 1 to the maturation inhibitor bevirimat is modulated by baseline polymorphisms in Gag spacer peptide 1. Antimicrob Agents Chemother. 2009, 53: 2185-2188. 10.1128/AAC.01650-08.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou J, Chen CH, Aiken C: Human immunodeficiency virus type 1 resistance to the small molecule maturation inhibitor 3-O-(3',3'-dimethylsuccinyl)-betulinic acid is conferred by a variety of single amino acid substitutions at the CA-SP1 cleavage site in Gag. J Virol. 2006, 80 (24): 12095-101. 10.1128/JVI.01626-06.View ArticlePubMedPubMed CentralGoogle Scholar
- Worthylake DK, Wang H, Yoo S, Sundquist WI, Hill CP: Structures of the HIV-1 capsid protein dimerization domain at 2.6 A resolution. Acta Crystallogr D Biol Crystallogr. 1999, 55 (Pt 1): 85-92.View ArticlePubMedGoogle Scholar
- Verheyen J, Verhofstede C, Knops E, Vandekerckhove L, Fun A, Brunen D, Dauwe K, Wensing A, Pfister H, Kaiser R, Nijhuis M: High prevalence of bevirimat resistance mutations in protease inhibitor-resistant HIV isolates. AIDS. 2010, 24 (5): 669-673. 10.1097/QAD.0b013e32833160fa.View ArticlePubMedGoogle Scholar
- Wainberg MA, Albert J: Can the further clinical development of bevirimat be justified?. AIDS. 2010, 24: 773-774. 10.1097/QAD.0b013e328331c83b.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.