This article has Open Peer Review reports available.
Machine learning classifiers provide insight into the relationship between microbial communities and bacterial vaginosis
© Beck and Foster; licensee BioMed Central. 2015
Received: 24 February 2015
Accepted: 30 July 2015
Published: 12 August 2015
Bacterial vaginosis (BV) is a disease associated with the vagina microbiome. It is highly prevalent and is characterized by symptoms including odor, discharge and irritation. No single microbe has been found to cause BV. In this paper we use random forests and logistic regression classifiers to model the relationship between the microbial community and BV. We use subsets of the microbial community features in order to determine which features are important to the classification models.
We find that models generated using logistic regression and random forests perform nearly identically and identify largely similar important features. Only a few features are necessary to obtain high BV classification accuracy. Additionally, there appears to be substantial redundancy between the microbial community features.
These results are in contrast to a previous study in which the important features identified by the classifiers were dissimilar. This difference appears to be the result of using different feature importance measures. It is not clear whether machine learning classifiers are capturing patterns different from simple correlations.
Advances in sequencing technology allow researchers to study microbial communities in new ways. Researchers use 16S rRNA sequencing to identify the bacteria present in microbial communities. These studies have found highly complex communities composed of hundreds or thousands of different bacteria types. Some microbial communities are found in or on other organisms. Known as microbiomes, these communities have been shown to play important roles in host health and disease. For example, in humans, gut microbiomes are important parts of digestion  and have been associated with obesity . Microbial communities in the lungs may exacerbate cystic fibrosis .
The vagina microbiome is often composed of hundreds of different bacteria types, although only a few taxa may be at high abundance . The composition of the vagina microbiome can be highly variable, both between women and through time . Additionally the microbiome is associated with bacterial vaginosis (BV), but in ways that are not clear.
BV is a disease characterized by an overgrowth of certain microbe types in the vagina. It is highly prevalent, with estimates of affected women as high as almost 30 % . Symptoms of BV include odor, discharge, and irritation. It is also associated with increased rates of preterm birth  and increased susceptibility to some STDs . While no single microbial cause of BV has been found, the microbial community as a whole is associated with BV .
Researchers often use two main BV diagnostics. The Nugent score is a measure based on cell morphology that can range from 0 to 10, with a score of 7 or greater indicating BV . The Amsel criteria include a vaginal pH greater than 4.5, a positive whiff test, the presence of clue cells, and the presence of discharge. The presence of three of these four criteria indicates BV .
Identifying the parts of the microbial community associated with BV is difficult. This is partly due to the large number of taxa found in the community and the even larger number of potential interactions between taxa. Variation in the microbial community between women and over time adds to the difficulty of the problem. Computational tools, however, may provide methods for studying these highly complex communities. In particular, machine learning methods may allow us to model complex relationships in the microbial community related to BV.
Machine learning methods are able to generate complex models describing the relationship between the microbial community and BV. Every machine learning method has a different technique for generating a classification model. However, the end result for each method is a model that classifies samples into BV categories. Two model characteristics are interesting. First, the model accuracy describes how well the model fits the data. Second, the important features of the model are those features that the model uses to classify the samples. These features allow the researcher to generate hypotheses about the underlying biology.
Previous research has found that classification models generated using genetic programming, random forests, and logistic regression classify microbial communities into BV categories with between 80 and 90 % accuracy . This research identified two challenges to using machine learning classifiers to study microbial communities. First, when the classification models are deconstructed to determine which features are important to the model accuracy, each machine learning technique identifies different features. This makes it difficult to determine if the identified features are actually important, or if they are the result of technical artifacts. Additionally, it is difficult to distinguish between features that are critical to the accuracy of the classifier and features that are only marginally helpful. While an importance measure is calculated for each feature, this measure is often only effective in ranking features, rather than determining how much each feature adds to the overall accuracy.
In this study, we use subsets of the full feature set in order to address these problems. We add features sequentially to the classification models and observe how the accuracy changes. This allows us to determine how many features are necessary to obtain high classification accuracy. Additionally, we generate models using random feature subsets in order to obtain a feature importance measure that is consistent across machine learning techniques. We find that random forests and logistic regression classifiers identify largely similar microbial community features. However, it is not clear whether these methods improve upon simple correlations.
We used datasets from studies published by Ravel et al.  and Srinivasan et al. . The Srinivasan et al. dataset includes both Amsel BV and Nugent score BV, while the Ravel et al. dataset includes only Nugent score BV. The Nugent score is an integer value between zero and ten derived from the number of specific symptoms observed, with a score of seven to ten diagnosing the presence of BV, below four representing absence of BV, and intermediate scores being inconclusive for diagnosis. The Amsel diagnostic observes specific symptoms and diagnoses the presence or absence of BV from those observations. Both datasets contain patient symptom data. In particular, this made it possible for us to perform BV diagnostics for patients using Nugent scoring for data in the Ravel et al. dataset, even though they did not report Nugent scoring explicitly. In addition, both studies present the presence of menses and vaginal pH. The Srinivasan et al. study also reported extensive patient symptoms such as vaginal itching and vaginal discomfort. Both studies also present relative abundance data for OTUs identified by reference to standard databases using amplicons from 16S hypervariable regions (see papers for details). These OTUs were named by the closest taxonomic unit that matched them in standard databases, and by non-specific names (such as BVAB1, 2, and 3, which are uncharacterized clostridia-like bacteria) when necessary. Thus the input for our work includes patient symptoms, BV diagnostic data, and microbial community composition, which constitute the features for the learning algorithms we tested. This is a mixture of continuous parameters such as population relative abundance and pH and categorical variables such as Nugent score and symptoms.
The Ravel et al. study includes 396 asymptomatic women of whom 97 were BV+ using a Nugent score definition (Nugent score ≥7). The Srinivasan et al. study includes 220 women, of whom 97 were BV+ using Amsel criteria and 117 were BV+ using Nugent score. We processed the datasets using methods similar to that in , with the exception of not collapsing microbes into correlated groups.
We used two different machine learning algorithms to generate classification models, random forests (RF) and logistic regression (LR). The RF classifiers were implemented using the randomForest function in the R package randomForest . We implemented LR classifiers using the glmnet function in the R package glmnet . To identify important features of RF models, features were ranked according to their increase in node purity (INP). INP is a measure of how much each feature increases the classification accuracy of each decision tree, averaged across all trees in the ensemble. For LR, features were ranked by their mean coefficient magnitude in all cross-validation datasets divided by their standard deviation.
In addition to the RF and LR classifiers, we also calculated reliefF rankings and correlations between the features and BV. ReliefF is a feature selection algorithm that estimates the relevance of each feature by how well it separates similar samples into classes . To calculate the reliefF rankings, we used the attrEval function in the R package CORElearn . The Pearson correlation between each feature and BV was calculated using R’s cor function.
To prevent over fitting, we used ten fold cross validation. We split each dataset randomly into ten parts. We used nine of the parts to train the classification models and the remaining part to measure the model accuracy. We repeated this process using each of the ten parts as the test dataset.
For each of the cross validation datasets, we fit RF and LR models to the full feature set of the training data. We then calculated the importance of each feature to these models. ReliefF was used to generate a third feature ranking. We then used these rankings to select feature subsets in three different ways.
The first analysis selected the top N features from each of the feature rankings, where N ranged between two and 25. We refer to this analysis as “N feature” subsets below. The second analysis used a five-feature sliding window across each of the rankings. We refer to this analysis as “sliding window” subsets below. The third analysis selected the top 50 features from each ranking and combined them into a single list, from which we selected three thousand subsets of five features each at random. RF and LR classifiers were trained on each subset using the training data. The accuracy of each classifier was determined using the testing data. We refer to this analysis as “random features” below.
The classification accuracy for each model was measured using the area under the receiver-operator curve (AUC). The receiver-operator curve (ROC) describes the classifier accuracy in both BV positive and BV negative samples, thus representing both type 1 and type 2 error. The area under the ROC is often used as a summary of the model accuracy .
Results and discussion
Differences in Amsel BV and Nugent score BV are apparent from these results. The classification accuracy is higher for Nugent score BV, indicating a better model fit. This may result from a closer link between Nugent score BV and the microbial community. It may also indicate that the relationship between Nugent score BV and the microbial community is more easily captured by the classification models. In other words, there may be a strong link between the microbial community and Amsel BV, but that link is complex and not fully exploited by the models. Alternatively, the Amsel BV classification may simply include more noise or error.
The sliding window subsets for the reliefF and RF rankings show substantial stability in classification accuracy as lower ranked features replace the first few high ranked ones. This pattern appears reduced for the LR rankings. Additionally, the sliding window subsets for RF rankings generally show a consistent decrease in classification accuracy as the feature ranking decreases. The reliefF and LR rankings, however, show a more uneven decrease in accuracy with feature ranking. Similar patterns would be expected if the initial rankings were incorrect. While the reason for this poor performance is unknown, it may be partially due to sensitivity of the importance measures to sparse data.
Top 15 important features. This table shows the top 15 features ranked by classification accuracy in five-feature subsets. The ranking shown here was obtained by averaging the results for the RF and LR classifiers
Ravel et al. Nugent
Srinivasan et al. Nugent
Srinivasan et al. Amsel
Eggerthella sp. type 1
Eggerthella sp. type 1
Dialister sp. type 2
Dialister sp. type 2
Megasphaera sp. type 1
Ruminococcaceae Incertae Sedis
Megasphaera sp. type 1
The important feature rankings appear highly dependent on the importance measure used. While there is some overlap in the top five to ten features identified by each importance measure, there are many features ranked highly by one importance measure and not others (Additional file 3). It is not clear whether the ranking differences are due to noise or whether they may reflect some biological pattern.
In this paper we found that only a few features are necessary to generate models with high BV classification accuracy. Additionally, there appears to be substantial redundancy in the microbial features. Random feature subsets allowed us to identify microbes important to BV classification. These taxa largely agree with those identified by other studies.
It is not clear if these classifiers find patterns that are any different from simple correlations. However, machine learning methods provide important accuracy measures that may help determine the number of features that are important. They may also indicate whether interaction terms are necessary to describe the system. Feature subset analysis illuminates many patterns and characteristics of the relationships between the microbial community and community characteristics such as BV. These methods may be generally useful for studying a wide range of microbial community related diseases and phenotypes.
We would like to thank Larry Forney, Terence Soule, Mark McGuire, and Jason Moore for helpful discussions. Funding for this project was provided by the NIH INBRE award P20GM016454 and by the NSF STC award DBI0939454. Computational support provided by NIH COBRE award P20GM16448.
- Bäckhed F, Ley RE, Sonnenburg JL, Peterson DA, Gordon JI. Host-bacterial mutualism in the human intestine. Science. 2005; 307(5717):1915–1920.View ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, et al.A core gut microbiome in obese and lean twins. Nature. 2009; 457(7228):480–4.View ArticlePubMedGoogle Scholar
- Willner D, Haynes MR, Furlan M, Schmieder R, Lim YW, Rainey PB, et al.Spatial distribution of microbial communities in the cystic fibrosis lung. ISME J. 2012; 6(2):471–4.View ArticlePubMedGoogle Scholar
- Ravel J, Gajer P, Abdo Z, Schneider GM, Koenig SS, McCulle SL, et al.Vaginal microbiome of reproductive-age women. Proc Natl Acad Sci. 2011; 108(Supplement 1):4680–687.View ArticlePubMedGoogle Scholar
- Gajer P, Brotman RM, Bai G, Sakamoto J, Schütte UM, Zhong X, et al.Temporal dynamics of the human vaginal microbiota. Sci Transl Med. 2012; 4(132):132–5213252.View ArticleGoogle Scholar
- Koumans EH, Sternberg M, Bruce C, McQuillan G, Kendrick J, Sutton M, et al.The prevalence of bacterial vaginosis in the united states, 2001–2004; associations with symptoms, sexual behaviors, and reproductive health. Sex Transm Dis. 2007; 34(11):864–9.View ArticlePubMedGoogle Scholar
- Hillier SL, Nugent RP, Eschenbach DA, Krohn MA, Gibbs RS, Martin DH, et al.Association between bacterial vaginosis and preterm delivery of a low-birth-weight infant. N Engl J Med. 1995; 333(26):1737–1742.View ArticlePubMedGoogle Scholar
- Wiesenfeld HC, Hillier SL, Krohn MA, Landers DV, Sweet RL. Bacterial vaginosis is a strong predictor of neisseria gonorrhoeae and chlamydia trachomatis infection. Clin Infect Dis. 2003; 36(5):663–8.View ArticlePubMedGoogle Scholar
- Oakley BB, Fiedler TL, Marrazzo JM, Fredricks DN. Diversity of human vaginal bacterial communities and associations with clinically defined bacterial vaginosis. Appl Environ Microbiol. 2008; 74(15):4898–909.View ArticlePubMedPubMed CentralGoogle Scholar
- Nugent RP, Krohn MA, Hillier S. Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation. J Clin Microbiol. 1991; 29(2):297–301.PubMedPubMed CentralGoogle Scholar
- Amsel R, Totten PA, Spiegel CA, Chen K, Eschenbach D, Holmes KK. Nonspecific vaginitis: diagnostic criteria and microbial and epidemiologic associations. Am J Med. 1983; 74(1):14–22.View ArticlePubMedGoogle Scholar
- Beck D, Foster JA. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PLoS ONE. 2014; 9(2):87830.View ArticleGoogle Scholar
- Srinivasan S, Hoffman NG, Morgan MT, Matsen FA, Fiedler TL, Hall RW, et al.Bacterial communities in women with bacterial vaginosis: high resolution phylogenetic analyses reveal relationships of microbiota to clinical criteria. PLoS ONE. 2012; 7(6):37818.View ArticleGoogle Scholar
- Liaw A, Wiener M. Classification and regression by randomforest. R News. 2002; 2(3):18–22.Google Scholar
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1.View ArticlePubMedPubMed CentralGoogle Scholar
- Robnik-Šikonja M, Kononenko I. Theoretical and empirical analysis of relieff and rrelieff. Mach Learn. 2003; 53(1–2):23–69.View ArticleGoogle Scholar
- Robnik-Sikonja M, Savicky P. CORElearn: CORElearn - Classification, Regression, Feature Evaluation and Ordinal Evaluation. 2013. R package version 0.9.42. http://CRAN.R-project.org/package=CORElearn.
- Bradley AP. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997; 30(7):1145–1159.View ArticleGoogle Scholar
- Fredricks DN, Fiedler TL, Marrazzo JM. Molecular identification of bacteria associated with bacterial vaginosis. N Engl J Med. 2005; 353(18):1899–1911.View ArticlePubMedGoogle Scholar
- Srinivasan S, Morgan MT, Liu C, Matsen FA, Hoffman NG, Fiedler TL, et al.More than meets the eye: Associations of vaginal bacteria with gram stain morphotypes using molecular phylogenetic analysis. PLoS ONE. 2013; 8(10):78633.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.