This article has Open Peer Review reports available.
An R package implementation of multifactor dimensionality reduction
© Winham and Motsinger-Reif; licensee BioMed Central Ltd. 2011
Received: 23 May 2011
Accepted: 16 August 2011
Published: 16 August 2011
A breadth of high-dimensional data is now available with unprecedented numbers of genetic markers and data-mining approaches to variable selection are increasingly being utilized to uncover associations, including potential gene-gene and gene-environment interactions. One of the most commonly used data-mining methods for case-control data is Multifactor Dimensionality Reduction (MDR), which has displayed success in both simulations and real data applications. Additional software applications in alternative programming languages can improve the availability and usefulness of the method for a broader range of users.
We introduce a package for the R statistical language to implement the Multifactor Dimensionality Reduction (MDR) method for nonparametric variable selection of interactions. This package is designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research. The 'MDR' package is freely available online at http://www.r-project.org/. We also provide data examples to illustrate the use and functionality of the package.
MDR is a frequently-used data-mining method to identify potential gene-gene interactions, and alternative implementations will further increase this usage. We introduce a flexible software package for R users.
With advances in genotyping technologies, a breadth of high-dimensional data is now available with unprecedented numbers of genetic markers to perform association mapping in human genetics. Identifying variants associated with complex human traits is a common problem and data-mining approaches to variable selection are frequent methods of analysis. There is growing evidence that epistasis may play a role in disease risk, and many variable selection approaches have been developed to consider potential gene-gene and gene-environment interactions. One of the most commonly used techniques for case-control data is Multifactor Dimensionality Reduction (MDR), a nonparametric exhaustive search method that considers all combinations of potentially interacting loci and classifies individuals to disease status based on their genetic information . MDR has been highly successful in human genetics, with a large number of associations identified in real data applications; additionally, the performance of the method has been extensively studied in a range of simulation experiments and has undergone numerous developments and extensions to improve performance [2, 3].
Currently software is available which implements the MDR method, including a GUI implementation available at http://www.epistasis.org; however, additional implementations in alternative programming languages are welcome in order to improve the widespread usability of the method for a broader range of users. The free and open-source R statistical software is one of the most widely-used statistical software environments. We introduce a new package for the R statistical language, 'MDR'. The package is designed to provide an alternative implementation for R users, and has great flexibility and utility for both data analysis and research. Currently, an R package exists to implement a parametric extension, model-based MDR ('mbmdr') , however, not in the original nonparametric form that is most commonly used and without extensive flexibility and analysis options. The package 'MDR' implements MDR for variable selection of interactions as first outlined in  and described in more detail in , providing options for internal validation and functions to summarize the fit and perform post-hoc inference, and is available at http://www.r-project.org/.
In its traditional implementation, MDR is considered both statistically and genetically non-parametric because it does not estimate any statistical model parameters or assume a particular genetic inheritance mode . MDR reduces the dimensionality of the data by viewing combinations of loci (that may interact) as a series of multi-factorial genotypes, rather than as separate variables. MDR creates a classification rule based on these combinations using a Naïve Bayes classifier, assigning genotype combinations with a large ratio of cases to controls as high-risk and low-risk otherwise . Using this high-risk/low-risk parameterization, a measure of the accuracy of the classification rule is evaluated, which is typically some measure of classification accuracy, the proportion of correctly classified individuals. A final model is chosen to maximize this accuracy, or to misclassify the fewest number of individuals. The final model will also perform well in terms of prediction, and internal validation measures such as cross-validation measure prediction accuracy . It is this traditional implementation that we employ in the R package 'MDR'.
where (TP, TN, FP, FN) represent the number of true positives, true negatives, false positives, and false negatives classified by a particular combination of loci, respectively. Balanced accuracy, the arithmetic mean of sensitivity and specificity, has been shown to outperform the traditional measure of classification accuracy when datasets are unbalanced . Other evaluation measures are possible, including additional contingency table measures , but are not currently included in this package.
This package assumes binary case-control data with categorical predictor variables. The binary response variable is coded as 0 or 1, and the categorical predictors (typically SNP genotypes) are coded numerically (0, 1, 2, etc.). The user can specify the particular genotype encoding. Additionally, the threshold for assigning high-risk/low-risk status to variable combinations can also be controlled by the user.
This package provides a base function 'mdr' to fit a list of MDR models, ranked with balanced accuracy. However, in all data-mining methods, over-fitting a model to a particular data set is a concern and it is suggested that MDR be implemented in conjunction with an internal validation technique. This package provides two such procedures: k-fold cross-validation and three-way split internal validation.
In k-fold cross-validation, the data are randomly split into k equal intervals, where k-1 intervals are used for training and one interval is used for testing . The best MDR model is determined from the training set for each size of interaction and an estimate of the model's prediction accuracy is calculated from the testing set. This procedure is repeated for all k possible splits of the data and a final model is chosen to maximize both prediction accuracy and cross-validation consistency across each split. The function 'mdr.cv' implements cross-validation and allows the user to specify the highest level of interaction to consider, as well as the number of intervals k; typically a value of k = 5 or 10 yields high performance .
In three-way split internal validation, the data are randomly split into three sets for training, testing, and validation . MDR is first implemented in the training set for all possible combinations of loci and the x models with the highest balanced accuracy are retained for evaluation in the testing set. MDR is next performed on all x models in the testing set and the best model for each level of interaction is preserved for evaluation of predictive ability in the validation set. A final model is chosen to maximize balanced accuracy in the validation set. The function 'mdr.3WS' implements three-way split internal validation and allows the user to specify the ratio of the three data splits (training:testing:validation), and also the number of potential models x from the training set to be evaluated in the testing set.
Both internal validation methods create objects of class 'mdr', a list of the final selected model loci and its prediction accuracy, the top models and their prediction accuracies, and the high-risk/low-risk characterization of the final model.
Three methods exist for objects of class 'mdr': 'summary', 'plot', and 'predict'. The 'summary' method provides a table summarizing the model fit at each stage of interaction. The 'plot' method provides a contingency table of bar graphs for the final model, portraying the numbers of cases and controls in each genotype combination, similar to the GUI implementation at http://www.epistasis.org. The 'predict' method allows the user to predict case-control status on a new, independent set of data with a model obtained from a previously fit 'mdr' object.
Post-hoc Functions for Inference
After an MDR model has been fit, a number of functions exist for inference on that fit. Permutation testing is available to test the significance of the reported measure of prediction accuracy; case-control status is randomly permuted a number of times (specified by the user), and the resulting prediction accuracies from each MDR fit of the permuted data sets are compared to a specified accuracy . In addition to the traditional permutation test of the full MDR model, we also incorporate a permutation test of interaction based on the likelihood ratio test, as described in Edwards et al . Additionally, estimates of prediction accuracy are obtained from retrospective case-control data, and therefore may not reflect the true accuracy of prospective predictions. Using a previously estimated population prevalence rate provided by the user, these prediction accuracy estimates can be adjusted using one of two available post-hoc procedures implemented in 'boot.error' and 'mdr.ca.adj' .
Results and Discussion
To illustrate the usage of the package, we provide a computational example using a simulated dataset of 250 individuals who were genotyped at 25 SNPs. We first fit an MDR model using cross-validation with cv = 5 cross-validation intervals. We consider all combinations of SNPs up to size K = 3 and the default settings for the other options and then summarize the fit:
> fit.cv<-mdr.cv(data = mdr1, K = 3, cv = 5, ratio = NULL,
equal = "HR", genotype = c(0, 1, 2))
Summary table for MDR fit with 5-fold cross-validation
4 6 9
We can also fit an MDR model using three-way split internal validation, also allowing for combinations of SNPs up to size K = 3 and the default settings for the other options, and then summarize the fit:
> fit.3WS<-mdr.3WS(data = mdr1, K = 3, × = NULL, proportion = NULL, ratio = NULL, equal = "HR", genotype = c(0, 1, 2))
Summary table for MDR fit with three-way split validation
4 9 24
> boot.error(mdr1,prev = 0.10, model = fit.cv$'final model', hr = fit.cv$'high-risk/low-risk', b = 100)
$'classification error estimate'
$'classification accuracy estimate'
> mdr.ca.adj(mdr1, model = fit.cv$'final model', hr = fit.cv$'high-risk/low-risk', prev = 0.10)
$'adjusted classification accuracy'
$'adjusted classification error'
After the prospective adjustment, we now estimate a prediction accuracy of around 52%, a reduction from the original retrospective estimate of 64.12%.
Sample run time in seconds for the package 'MDR' and for the GUI version
MDR1 ( n = 250, p = 25)
MDR2 ( n = 250, p = 50)
MDR3 ( n = 500, p = 50)
The R computing environment is known to be much slower than competing languages such as C++ and Java, so the increased run-time as compared to the Java GUI implementation is not surprising or unreasonable (see http://dan.corlan.net/bench.html). Increased computation time, particularly for high-dimensional data is a limitation of R as compared to other programming languages. While a traditional R package cannot compete with Java or C++ in terms of computation time, reducing computation time is possible. For instance, parts of the R package source code could be written in C. Furthermore, because many of the calculations of MDR are independent, many of the looping constructs could be executed in parallel. Great strides have recently been made in the areas of parallel computing in R, and this package could be extended to include parallelization using a number of recently developed packages such as 'foreach', 'doMC', and 'doSNOW' (see http://cran.r-project.org/web/views/HighPerformanceComputing.html). The use of parallel computing could drastically reduce computation time for MDR, particularly on a cluster machine. Because of the variation in R usage on single workstations, multiple workstations, and multi-node clusters, parallelization is not currently implemented in this package. Additionally, there are memory limitations to R in terms of high-dimensional datasets, which are typically experienced with genetic data. Advances have been made in terms of increased memory, and the 'bigmemory' package allows the user to store and analyze large datasets. The open source nature of the R environment and this package allow this flexibility for these types of extensions.
Due to these limitations in the current implementation, without the aforementioned extensions, the usefulness of this package is primarily reserved for smaller candidate gene analysis and/or searches for low order models in larger scale candidate gene searches in real data as well as methodological research. In real data analysis, the package is most suitable for a moderate number of loci to evaluate candidate interactions rather than a genome-wide variable selection. Moreover, the R implementation allows the user to integrate this data-mining analysis into more traditional statistical analyses. In addition, because it's written in such a flexible environment, the package allows for easy extension of the MDR methodology for further research.
We introduce new software to implement the MDR method for variable selection of epistatic interactions using the R statistical language. The package 'MDR' is designed to provide an alternative implementation for R users, with great flexibility and utility for both data analysis and research.
Availability and Requirements
Project name: R package, MDR
Project home page: http://cran.r-project.org/web/packages/MDR/index.html
Operating systems: Linux, Mac OS, Windows
Programming language: R
Other requirements: R package, lattice
License: GNU GPL-2
Any restrictions to use by non-academics:
This work was supported by Grant T32GM081057 from the National Institute of General Medical Sciences and the National Institute of Health. This package was previously described in a North Carolina State University Department of Statistics Technical Report, available at http://www.stat.ncsu.edu/information/library/mimeo.html. We would like to thank David Reif for his help and input.
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69 (1): 138-147. 10.1086/321276.View ArticlePubMedPubMed CentralGoogle Scholar
- Ritchie MD, Motsinger AA: Multifactor dimensionality reduction for detecting gene-gene and gene-environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005, 6 (8): 823-834. 10.2217/14622422.214.171.1243.View ArticlePubMedGoogle Scholar
- Moore JH: Detecting, characterizing, and interpreting nonlinear gene-gene interactions using multifactor dimensionality reduction. Adv Genet. 2010, 72: 101-116.View ArticlePubMedGoogle Scholar
- Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003, 19 (3): 376-382. 10.1093/bioinformatics/btf869.View ArticlePubMedGoogle Scholar
- Calle ML, Urrea V, Vellalta G, Malats N, Steen KV: Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Statistics in Medicine. 2008, 27 (30): 6532-6546. 10.1002/sim.3431.View ArticlePubMedGoogle Scholar
- Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. Journal of Theoretical Biology. 2006, 241 (2): 252-261. 10.1016/j.jtbi.2005.11.036.View ArticlePubMedGoogle Scholar
- Motsinger AA, Ritchie MD: The effect of reduction in cross-validation intervals on the performance of multifactor dimensionality reduction. Genet Epidemiol. 2006, 30 (6): 546-555. 10.1002/gepi.20166.View ArticlePubMedGoogle Scholar
- Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007, 31 (4): 306-315. 10.1002/gepi.20211.View ArticlePubMedGoogle Scholar
- Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD: Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. Bmc Bioinformatics. 2008, 9:Google Scholar
- Winham SJ, Slater AJ, Motsinger-Reif AA: A comparison of internal validation techniques for multifactor dimensionality reduction. Bmc Bioinformatics. 2010, 11: 394-10.1186/1471-2105-11-394.View ArticlePubMedPubMed CentralGoogle Scholar
- Motsinger-Reif AA: The effect of alternative permutation testing strategies on the performance of multifactor dimensionality reduction. BMC Res Notes. 2008, 1: 139-10.1186/1756-0500-1-139.View ArticlePubMedPubMed CentralGoogle Scholar
- Edwards TL, Turner SD, Torstenson ES, Dudek SM, Martin ER, Ritchie MD: A General Framework for Formal Tests of Interaction after Exhaustive Search Methods with Applications to MDR and MDR-PDT. PLoS One. 2010, 5 (2):Google Scholar
- Winham SJ, Motsinger-Reif AA: The effect of retrospective sampling on estimates of prediction error for multifactor dimensionality reduction. Ann Hum Genet. 2011, 75 (1): 46-61. 10.1111/j.1469-1809.2010.00587.x.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.