Risk estimation using probability machines
 Abhijit Dasgupta^{1}Email author,
 Silke Szymczak^{2, 5},
 Jason H Moore^{3},
 Joan E BaileyWilson^{2} and
 James D Malley^{4}
DOI: 10.1186/1756038172
© Dasgupta et al.; licensee BioMed Central Ltd. 2014
Received: 20 June 2013
Accepted: 19 February 2014
Published: 1 March 2014
Abstract
Background
Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios.
Results
We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented.
Conclusions
The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model misspecification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from.
Keywords
Consistent nonparametric regression Logistic regression Probability machine Odds ratio Counterfactuals InteractionsBackground
Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features, both categorical and continuous. It is widely used both as an association model and a predictive model, to look at (a) the conditional probability of outcome, given predictors, and (b) predictor effect size estimates using conditional odds ratios. It is also widely available in software and is easy to optimize in lowdimensional problems. However, it does assume the datagenerating model to be logistic, requires an explicit specification of the model, and is not scalable to the higher dimensional problems that are so common today. Good predictions and effect size estimation from a logistic model therefore require the researcher to guess at the true data generating model, and to exactly specify which predictors appear and how they interact with each other. If the model is misspecified, both predictions and effect size estimates may be more than slightly in error. In modeling terms, the challenge is to get all the main effects and interactions (2way and higher order) correctly specified in the model; otherwise efficient and consistent estimation is not certain.
We recently introduced the concept of a probability machine (PM)[1], which is simply any consistent nonparametric regression machine applied to binary or categorical outcomes. A PM produces a predicted conditional probability of success given predictors, but in a modelagnostic datadriven fashion. The idea is that, starting from binary (0/1) outcomes for each subject, a PM generates an estimated expected value for each subject, which is just the conditional probability of success for that subject given predictors. Given today’s rich data environment, a PM has several desirable properties. It can study any list of predictors (binary, categorical, continuous), requires no explicit structural specification of the model, no specification of interactions, and is scalable to very large sets of predictors, for example over a million single nucleotide polymorphisms (SNPs) in genomewide association studies. As a practical matter, there are many wellstudied families of nonparametric learning machines that are provably consistent for the regression problem–in the limit of large data the error rate converges to the Bayes error rate–and with reasonably speedy convergence. The class of nonparametric regression machines we have studied come from the machine learning literature, namely random forest regression [2] and nearest neighbor regression. Both of these have provable consistency properties under fairly general conditions [3–5]. There are of course other possible choices, like some support vector machines. In this work, we focus on random forest regression used with a {0,1} outcome and call it a random forest probability machine (RFPM).
A RFPM, which is a regression random forest, is scalable to large datasets and highdimensional problems and requires only a specification of which features are to be included in the machine rather than any explicit functional form. We have shown in our earlier work [1] that even when data is generated from a logistic model, the test set error when estimating the outcome probability of success using a RFPM often beats that using logistic regression. In this work we look at the role of logistic regression in particular and how the PM can play a similar role both in probability prediction and effect size estimation under prospective sampling schemes. We show that with data generated by a logistic model, a PM can do just as well as a correctlyspecified logistic regression for both problems, and can be superior when the logistic regression is not correctly specified or inappropriate given the data generating model. We also suggest that a PM can do more than just probability and effect estimation as with logistic regression. It can also be used for descriptive discovery of interactions, among other descriptive analyses. Moreover, as probability machines are known to be consistent, the probability estimates and derived risk estimates will also be consistent. This also holds for interaction detection across any subsets of features. We expand on these themes in the following sections.
Methods
Conditional probability estimation
We first consider the problem of conditional probability estimation. Under prospective sampling schemes, both logistic regression (LR) and PMs can estimate the conditional probability of success given a set of features. The consistency of the LR model is established as long as the data generating mechanism is logistic, and the correct model is specified. Under model misspecification, it is known that the LR model is no longer consistent. As practitioners know, establishing the adequacy of a particular logistic model for a data set is not easy, and no tests to check the validity of the logistic link are routinely available. We believe that any logistic regression model fit to a data set is likely to be incorrect in its specification, though in some cases we may be more confident based on other knowledge about the data generating mechanism. The RFPM, on the other hand, is known to be consistent under more general conditions [3, 5] and thus can produce valid estimates of the conditional probability under a wider variety of conditions. We note here that we will use a random forest of regression trees, where each component tree provides a probability estimate of success (denoted in [1] as regRF). These individual probability estimates are then averaged across the trees to provide the regRF estimate. The random forest methodology also can provide a selfdeclared “probability estimate” when used as a classifier, which is nothing but the proportion of component trees which classified the result as a success. We have found in [1] that this method, there denoted by classRF, is far less efficient at probability estimation, and the consistency of the probability estimates thus produced has not been demonstrated in the literature.
Simulations
 1.
A main effects logistic regression using all available features (LR1)
 2.
A logistic regression with main effects and all possible twoway interactions using all available features (LR2)
 3.
A random forest regression using all available features (RFPM, denoted as RF in the graphs)
Summary of logistic regression models used for simulation studies
Model  Description  Conditional odds ratios 

1  3 main effects  1.3, 1.7, and 2.5 
2  3 main effects and 2 interactions  1.3, 1.7, 2.5 (main effects); 2, 5 (interactions) 
3  See Table 2  
4  2 main effects  1.3, 2 (main effects) 
5  2 main effects + 1 interaction  1.3, 2 (main effects); 2 (interaction) 
Structure of the model in model 3
Stratum  1  2  3  4  5  6  7  8 

X _{ 1 }  0  1  0  0  1  1  0  1 
X _{ 2 }  0  0  1  0  1  0  1  1 
X _{ 3 }  0  0  0  1  0  1  1  1 
Probability  0.300  0.176  0.563  0.391  0.563  0.096  0.794  0.836 
LR1 is the usual model that is tried in most situations, since interactions are harder to estimate and require more data for adequate power and efficiency. LR2 is a model that accounts for 2way interactions in a noncommittal manner. A priori, we do not know which interactions are actually present, and so we will interrogate all possible such interactions. Note that this is not a fully saturated model, since that would include all higher order interactions as well. For data with a large number of available features, the LR2type model quickly becomes unfeasible, since the number of parameters grows at the rate p ^{2} when p is the number of available features. The RFPM only needs which features to include, and will consider complex interactions implicitly, as we will see. It will also not break down as quickly as p increases, and is used routinely with p as large as 100 K or 1 M or even larger; we have experience using it with over 100 million features in a genomic data set [6].
In all our simulations, we use the randomForest package ([7], version 4.6) in R, with tuning parameters nodesize (minimum size of the terminal nodes) set at 5% of the total sample size, and mtry equaling the number of features. We have experimented with smaller mtry values but have found that setting mtry to be the number of features gives the best results. We have also experimented with setting the number of trees in the random forest to be anywhere from 20 to 1000, and do not see much difference in the results over this span. Our reported simulations use 100 trees per random forest. We have also done simulations in Python using the random forest function from the scikitslearn Python package [8], giving similar results to the R runs, albeit at slightly slower speed. Since the results from Python are similar to the results from R, only results from R are reported here. The code to run these simulations, in either R or Python, is available upon request.
Model 2 is the same as model 1, except that there are interactions between X _{ 1 } and X _{ 2 } and between X _{ 2 } and X _{ 3 } with interaction odds ratios of 2 and 5, respectively. The quality of the estimated probabilities of success, conditional on the predictors, was evaluated using bias and efficiency. Bias is measured by the difference between the true conditional probability used to generate the data sets and the average individual prediction over the simulated data sets. Efficiency in predicting the individual conditional probability is measured by the width of the interval defined by the 5th and 95th percentiles of the simulated distribution of predictions for each individual.
Counterfactual machines
Counterfactual arguments are the basis of our interpretation of regression coefficients. Regression coefficients are interpreted to be the average change in outcome when a predictor changes by one unit, all other predictors remaining the same. In other words, we are asking what could happen if we change just one factor, everything else being equal. We argue that, using probability machines, we can look directly at counterfactual outcomes in the context of binary outcomes to see the counterfactual change in success probability for each individual when one binary predictor is changed. The core idea can be extended easily to continuous outcomes and conceptually to continuous predictors, which we will present in a future manuscript.
Conceptually, we can consider two groups of individuals where each individual in one group is identical to an individual in the other group, except for the value of one feature X _{ 1 }. If this were possible, we could directly observe the changes in outcome in each person resulting from changing the value of X _{ 1 } by merely observing their corresponding doppelganger in the other group. Such experiments are carried out regularly in, for example, comparing wildtype clonal mice versus mice where a particular gene is knocked out, while treating both groups the same. Such experiments are of course not possible in human population studies. We can, however, get a very good sense of how one’s doppelganger would behave using predictive models. We train two predictive models, one for individuals with X _{ 1 } = 0 and one for individuals with X _{ 1 } = 1. Each captures the feature landscape, so to speak, and its relationship with the outcome in each subgroup. Now, if we predict the outcome of an individual with X _{ 1 } = 1 using the predictive model trained on the X _{ 1 } = 0 subgroup, it would be as if we supplanted this individual into the feature landscape of the X _{ 1 } = 0 group, and the prediction would be reflective of the relationship between this feature landscape and the outcome, preserving all the other feature information about this individual. In other words, we can mimic the behavior of this individual’s conceptual doppelganger, and the difference between an individual’s observed outcome and their predicted outcome using a predictive model trained on the other group would be the counterfactual effect of X _{ 1 } on that individual.
We can operationalize the description above using RFPMs. Suppose we want to predict the counterfactual outcomes for each individual when the value of the binary predictor X _{ 1 } is changed. We split the dataset into two subgroups D _{ 0 } and D _{ 1 } based on whether X _{ 1 } is 0 or 1. We now train identically specified RFPMs on each subgroup, calling the trained RFPMs PM _{ 0 } and PM _{ 1 } respectively. Now, for binary data, we can’t directly observe the probability of success, but we can estimate it based on our models. For an individual with X _{ 1 } = 0, their “observed” probability of success would be the prediction from PM _{ 0 } and their counterfactual probability of success would be their prediction from PM _{ 1 }. We can similarly compute the “observed” and counterfactual probabilities of success of an individual with X _{ 1 } = 1 using predictions from PM _{ 1 } and PM _{ 0 } respectively. Thus, for each individual, we can compute under the RFPM model a probability p _{ 0 } and a probability p _{ 1 } of success under the conditions X _{ 1 } = 0 and X _{ 1 } = 1 respectively.
Conceptually there is nothing in this operationalization that limits us to RFPMs or even PMs. You could do the same exercise using a logistic model as well. However, a logistic model would just give back estimates reflective of the chosen model structure provided it is the correct model; it would merely be a selffulfilling exercise. If the logistic model is misspecified, this exercise will give you estimates that are different from what the model provides. PMs provide nonparametric estimates without a particular model structure, so this exercise can help find dependency patterns in the data without assumptions about particular links or particular structural constraints like linearity.
Risk machines: generating risk effect estimates using probability machines
 1.
the risk difference (RD): p _{1}  p _{0}
 2.
the risk ratio (RR): p _{1}/p _{0}
 3.
the odds ratio (OR): p _{1}(1  p _{0})/p _{0}(1  p _{1})
We can compute groupspecific estimates of each of these functions by averaging (mean or median) the individual estimates over the members of the group; the overall conditional estimates or main effects estimates are obtained by averaging over the entire study. Note that since the estimation targets the featurespecific counterfactual probabilities per subject, we are free to choose any function of them for our risk estimates. We call this method generally the “twomachine method”, since we need to train 2 machines, one on each subgroup defined by the predictor of interest X _{ 1 }.
Often we are not interested in the overall main effect but in subjectspecific effects, which can lead to discovery as well as estimation of interactions. In fact, this is probably the more frequent case. That is, the assumption that the effect of a feature is unaffected by all other features–a pure main effects model–needs to be tested before it is believed. Unfortunately computational difficulties in assessing complex interactions over a large number of features under a logistic regression paradigm have made the main effects model the de facto standard rather than something to be validated. Our scheme allows us to easily obtain odds ratio estimates, or other risk function estimates, for subgroups defined by a second feature or a set of features by merely averaging the individual odds ratio estimates over each subgroup. This would then enable a very easy computation of the interaction odds ratio. In this paper, we present a second method for interaction estimation that directly leverages the counterfactual argument.
Consistency of each machine on its defining data set implies consistency of the risk estimates. Discrete features in the data with more than the two levels [Yes, No] lead to finitely more machines, yet the new calculations are straightforward. Features with continuous values could be studied by binning the exposure data, but this approach too often imposes an unacceptable loss of information and so requires further study.
Simulations
We followed the simulation setup described earlier, and consider scenarios with only main effects. Main effects odds ratios for each simulated data set were computed directly from each logistic regression model (LR1 and LR2); for the RFPM, subjectspecific odds ratio estimates were obtained using twomachine counterfactual machines for each predictor, and the main effect odds ratio estimates were obtained by averaging over the individual odds ratio estimates. We report the results of the simulation study using Model 1, which has 10 independent binary predictors of which three have nonnull main effects and no interactions between the predictors.
The main message here is that the twomachine counterfactual machine method can reproduce the true odds ratios from the simulation in an unbiased manner, and have efficiencies not much worse than the logistic main effects model, which is the data generating model. A logistic model which is misspecified for the data generating model produces both inaccurate predictions and biased effect estimates, even if the correct predictors are included. Moreover, it is difficult to assess whether the model is misspecified without further modeling and testing, a fact that is often unaccounted for in deriving inference from the final model. The RFPM model where the correct predictors are included accounts for the patterns in the data to provide accurate predictions and individual effect size estimates. Aggregate estimates of main effects and interactions and exploration of whether interactions are present can be done based on a single modeling run. The advantage that the risk machine method has is that it is not constrained by having to guess the datagenerating model. In fact, in the simulation setting in Figure 3, where each of 8 predictor subgroups has a unique success probability, generated by a fully saturated model, the risk machine can in fact estimate the subgroupspecific odds ratios accurately by using an appropriate number of counterfactual machines (in this case, 2^{3} = 8 machines) or by averaging the individual odds ratio estimates over the appropriate subgroups from a single machine run.
Interaction detection and estimation
Interaction estimation
It is sometimes thought that nonparametric statistical learning machines cannot accurately estimate effect sizes. As shown above using the twomachine method, consistency of the probability machine, and therefore of the risk machine, demonstrates otherwise. It is effective as a practical method for main effects odds ratio estimation, given logistic regression data over binary predictors. Of course the multiple machine method for risk effect estimation is not restricted to logistic regression data, and can be applied to any regression problem with binary outcomes.
We now consider the problem of interactions. We will start be describing estimation of multiplicative interactions as are seen in the logistic model context, for binary predictors. We introduce a “4machine method” analogous to the earlier 2machine method to estimate interaction effects. This works as follows. For estimating the interaction effect due to two binary predictors X_{1} and X_{2}, each taking values in {0,1}, parse the data into four groups defined by the four possible values of X_{1} and X_{2}, that is the four subsets of the data defined by {X_{1}, X_{2}} = {(0, 0), (0, 1), (1, 0), (1, 1)}. Fit a probability machine to each group separately. Note that within each subgroup the values for {X_{1}, X_{2}} remain fixed, so the machine will not use the specific values for either feature. The separate machines are distinctly constructed. They don’t see each other or the data for the other groups: the separate model predictions for the probability of success in the outcome are therefore conditional on the particular combinations of the predictors {X_{1}, X_{2}}. We now estimate the probability of success of each individual using each of the 4 machines, arriving at a vector of probabilities (p _{00}, p _{01}, p _{10}, p _{11}) for each individual. One of these represents the observed probability and the others represents counterfactual probabilities under the other (X_{1},X_{2}) combinations. It is now straightforward to compute the interaction ratios for each individual, which is the ratio of the odds ratios p _{11}(1  p _{10})/(1  p _{11})p _{10} and p _{01}(1  p _{00})/(1  p _{01})p _{00}. The overall odds ratio can be obtained by averaging these individual odds ratios over the study population. This is in fact exactly analogous to the interaction effect estimated in a logistic regression, apart from the log transformation applied to the odds ratios that is standard in logistic regression.
We note here that we need to run 4 machines to estimate each interaction, in addition to an overall machine that could be used to identify strongly predictive features. This is obviously burdensome in many current contexts. Our philosophy is not to estimate all possible interactions, but to identify particular “interesting” interactions based either on biology, previous results, or the predictive power of the features based on a global run. We note here that creating 4 subgroups may produce subgroups with a wide range of sample sizes even with moderately unbalanced predictors. Working with unbalanced predictors is not a problem when using regression machines since we are estimating the expectation of a binary outcome. We have run several simulations to satisfy ourselves that this theoretical result holds. However, due to potential loss of sample size in each stratum, there may be the need to adjust the parameters of the machine, such as the terminal node size in a random forest, to accommodate the practical issues of fitting the model accurately.
Intuitive modelfree interaction detection
We now present a method for exploring the presence of interactions from a single machine run. This method cannot be used directly for estimation, though the derived estimates might be reasonably close to the truth, since accounting for counterfactual effects is not done. We present this method using RFPM as our representative probability machine, though other machines can be used just as effectively.
We will use RFPM to discover interactions visually and intuitively without invoking parametric models using a single global machine run. We have seen that fitting a RFPM to binary outcome regression data provides consistent estimates of the probability Prob(Y = 1X). A single run only gives predictions of the conditional probabilities and not any counterfactual predictions–it is not a counterfactual machine.
Note that these are the subgroupspecific averages from a single probability machine run and not counterfactual computations. An interaction plot is created by displaying these four values against values of X _{ 1 } . As in classical analysis of variance, we check if the line joining p _{ 01 } and p _{ 11 } is parallel to the line joining p _{ 00 } and p _{ 10 }. This is a check for multiplicative interactions. For additive interactions, we replace the averages over the logittransformed probabilities above with the averages of the probability estimates themselves.
Simulations
Consider a logistic model with outcome Y and ten binary predictors X _{ 1 } ,…,X _{ 10 } each with P(X _{ i } = 1) = 0.3, with main effects odds ratios of 1.3 and 2 corresponding to X _{ 1 }, and X _{ 2 }, and 1 for the rest. Call this Model 4. We add interaction odds ratios of 2 corresponding to the predictor pair (X _{ 1 },X_{2}) to Model 4 and call it Model 5. We generate 1000 data points under each model and repeat 1000 times.
We can compute the classical contrast T = p _{ 11 } p _{ 10 } p _{ 01 } + p _{ 00 } from the fitted machine. If the lines are parallel, i.e., no multiplicative interaction, T should be 0. This contrast can be used as an indicator to detect the presence of an interaction over pairs of features. In Model 4, T = 0.004 (true value = 0) and in Model 5 it is 0.747 (true value = 0.693) for the (X _{ 1 },X _{ 2 }) interaction.
The discovery method leveraging interaction plots and the linear contrast can be used to scan pairs of predictors to quickly find potentially interacting predictors, and the 4machine estimation method can be used to estimate the interaction effects for those interacting predictors. We can visualize the sets of potentially interacting predictors using a heat map, where each axis has the set of predictors of interest and the color is based on the magnitude of the linear contrast statistic T. Note that we can just as easily investigate additive interaction under the same general scheme with the exact same RFPM as before, with no additional modeling runs; this is due to the fact that we are predicting the counterfactual probabilities in a nonparametric, modelfree manner and so can estimate any particular function of them immediately. We are no longer limited by the logistic model’s constraint of multiplicative interaction estimation.
The extension of this methodology to categorical predictors is straightforward. Plausible and efficient extensions to continuous predictors are under study.
Conclusions
We believe there are several advantages to the learning machine approach for risk effect estimation. This approach leverages directly the concept of counterfactual estimates and consistent predictive models to get predictions of the counterfactual probabilities. First, since these predicted counterfactual probabilities are generated by probability machines which are consistent, the individual counterfactual probability estimates should also be consistent, within the respective contexts X = 0 and X = 1. Obtaining good estimates of these individual probabilities grants us the flexibility of directly estimating different risk effect functions, such as risk differences, risk ratios and odds ratios, that are all based on the standard counterfactual argument. More precisely, these risk machine estimates are nonparametric estimates of the population effects and are not influenced by assumptions made in modeling as is essential for using classical logistic regression. This property also allows us to empirically understand the nature of the effect in the additive, multiplicative or other scale, rather than having to begin with an assumption of multiplicative interaction, as when using logistic regression.
Second, one cannot typically assume that data is generated from purely main effects. Our simulations show that if interactions are suspected, then RFPM is an accurate and more efficient choice for estimating the conditional probabilities, and then for estimating the odds ratios, than is the exploratory logistic regression, LR2, which includes all twoway interactions. Of course, interactions need not be limited to twoway interactions, and including higher order interactions in a logistic framework quickly increases the number of parameters to be estimated, hence further reducing efficiency. The modelfree risk machine approach also allows us to freely consider complex nonlinear interactions, as when the odds ratios or risk ratios change nonlinearly with another continuous predictor.
Third, the risk machine RFPM intrinsically incorporates higher order interactions in its tree and forest based probability estimation and so can smoothly accommodate risk estimation over higher order interactions, without having them inserted as features in the analysis before the machine is applied to the data.
Fourth, note that for all the experiments described, the specification of the RFPM machine was identical, in that we only tell the machine what the predictors are and identify the outcome: nothing more is assumed as part of any model building approach. We can be entirely agnostic to the generative mechanism of the data while invoking a nonparametric risk machine such as RFPM and still end with good estimation and prediction. Such is not the case for a logistic regression approach unless it estimates from a correct and fully specified model.
The risk machine approach, therefore, is an efficient and practical way to interrogate data with binary outcomes, free of the usual hazard of model misspecification. Effectively the researcher does not need to validate or calibrate a parametric model before efficient and unbiased risk estimation can be studied, and the data analytic energy can be directed at consideration of what makes functional sense for risk estimation across all the features.
Abbreviations
 PM:

Probability Machine
 RFPM:

Random Forest Probability Machine
 LR:

Logistic regression
 OR:

Odds ratio
 RR:

Relative risk
 RD:

Risk difference.
Declarations
Acknowledgements
This project was supported by the Intramural Research Programs of the National Institute of Arthritis, Musculoskeletal and Skin Disorders (AD), the National Human Genome Research Institute (SS, JEBW), and the Center for Information Technology (JDM), National Institutes of Health, as well as by NIH grant LM009012 (JHM). The authors would also like to acknowledge numerous discussions with other colleagues, including Deanna Greenstein, Larry Brody and Nilanjan Chatterjee.
Authors’ Affiliations
References
 Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A: Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2012, 51: 7481. [http://dx.doi.org/10.3414/ME00010052]View ArticlePubMedGoogle Scholar
 Breiman L: Random forests. Mach Learn. 2001, 45: 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. J Mach Learn Res. 2008, 9: 20152033.Google Scholar
 Biau G, Cerou F, Guyader A: On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res. 2010, 11: 687712.Google Scholar
 Biau G: Analysis of a random forests model. J Machine Learning Res. 2012, 13: 10631095.Google Scholar
 Chen ZX, Sturgil D, Qu J, Jiang H, Park S, Boley N, Suzuki AM, Fletcher AR, Plachetzki DC, FitzGerald PC, Artieri CG, Atallah J, Barmina O, Brown JB, Blankenburg KP, Clough E, Dasgupta A, Gubbala S, Han Y, Jayaseelan JC, Kalra D, Kim YA, Kovar CL, Lee SL, Li M, Malley JD, Malone JH, Mathew T, Mattiuzzo NR, Munidasa M: Comparative analysis of the D. melanogaster modEncode transcriptome annotation. Genome Research. in press
 Liaw A, Wiener M: Classification and regression by random forest. R News. 2002, 2 (3): 1822.Google Scholar
 Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E: Scikitlearn: machine learning in python. J Mach Learn Res. 2011, 12: 28252830.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.