 Methodology
 Open access
 Published:
Risk score modeling of multiple gene to gene interactions using aggregatedmultifactor dimensionality reduction
BioData Mining volume 6, Article number: 1 (2013)
Abstract
Background
Multifactor Dimensionality Reduction (MDR) has been widely applied to detect genegene (GxG) interactions associated with complex diseases. Existing MDR methods summarize disease risk by a dichotomous predisposing model (highrisk/lowrisk) from one optimal GxG interaction, which does not take the accumulated effects from multiple GxG interactions into account.
Results
We propose an AggregatedMultifactor Dimensionality Reduction (AMDR) method that exhaustively searches for and detects significant GxG interactions to generate an epistasis enriched gene network. An aggregated epistasis enriched risk score, which takes into account multiple GxG interactions simultaneously, replaces the dichotomous predisposing risk variable and provides higher resolution in the quantification of disease susceptibility. We evaluate this new AMDR approach in a broad range of simulations. Also, we present the results of an application of the AMDR method to a data set derived from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX) that revealed several GxG interactions in the folate pathway that were associated with treatment response. The epistasis enriched risk score that pooled information from 82 significant GxG interactions distinguished MTX responders from nonresponders with 82% accuracy.
Conclusions
The proposed AMDR is innovative in the MDR framework to investigate aggregated effects among GxG interactions. New measures (pOR, pRR and pChi) are proposed to detect multiple GxG interactions.
Background
Human diseases usually have complex inheritance patterns, and single nucleotide polymorphisms (SNPs) has been utilized to explain the variation in susceptibility to many common complex diseases as well as the response to drug therapy. The advancement of genotyping technology has made genotypic data readily accessible to investigators at low cost. However, many challenges remain with regard to identifying genes that render people susceptible to nonMendelian disorders and in understanding the associations and functional relationships among genes. More and more, researchers have been advocating for advanced statistical analysis to quantify complex and interactive biological and genetic relationships[1, 2].
Multifactor Dimensionality Reduction (MDR) is a statistical paradigm for characterizing and detecting nonlinear complex genetogene interactions (epistasis) possibly associated with susceptibility to disease[3]. When numerous genes are involved in a complex canonical pathway or network, traditional approaches for data analysis, such as a Chisquare test or Fisher’s exact test, might not detect the associations between risk factors and outcomes since these approaches assess only marginal main effects of the identified risk factors. Although one can employ logistic regression or other standard multivariate categorical data analysis approaches to explore interactions among SNPs, there are an enormous number of possible interactions in a model with both linear and nonlinear effects. Consequently, standard multivariate categorical data analysis approaches might detect very few interactions, and even then the cost in terms of sample size might be immense. MDR addresses these difficulties by converting highdimensional genotypic data into a single predictive variable. Genotypic combinations are used to define high risk and low risk strata for the onedimensional predisposing risk factor. MDR can reveal nonlinear epistasis at a moderate sample size with no requirements on the underlying distributions of genotypes or outcomes[4].
The most commonly used MDR approach is described in detail by Ritchie et al.[3]. To distinguish this method from its various extensions, we will refer to it as the original MDR method. Related statistical software has been developed by Hahn et al.[4], Bush et al.[5], Winham and MotsingerReif[6], and Moore and colleagues (http://www.epistasis.org). In general, the MDR process can also be combined with a filter preprocess step by first applying global testing and filtration techniques to select the optimal number of SNPs for MDR analysis by searching for a subset of genes likely to interact with other genes using the ReliefF filtering process[7, 8].
Details of the MDR[3] are briefly described here. MDR performs an exhaustive search of all variables and variable combinations to identify univariate or multivariate disease risk models. For each locus or multilocus combination, attribute construction is performed to make a single variable with two categories: high risk and lowrisk. A genotype or combination of genotypes is assigned high risk status if the ratio of affected subjects to unaffected subjects exceeds a predetermined threshold, and lowrisk otherwise. This step consolidates the highdimensional risk space into a onedimensional predictive variable. Typically, a 5fold or 10 fold crossvalidation procedure is employed, beginning with the random division of the original data set into five or ten subsets of approximately equal sizes[9]. For 10fold crossvalidation, a model is fit using nine of the ten subsets (collectively referred to as training data), and then the model is applied to classify observations in both the training data and the tenth subset not used to fit the model (referred to as validation data). This entire process is repeated ten times, with one of the ten subsets acting as the validation data[10]. The model’s training accuracy and testing accuracy are defined as the percentage of correct classifications in the corresponding data sets. The optimal onelocus, twolocus, and threelocus MDR models with the highest testing accuracy are identified. A onelocus model estimates the main effect of each SNP, while multilocus models investigate the interactions among relevant SNPs. The cross validation consistency (CVC) is the number of times in a 10fold cross validation that a particular multifactorial combination is identified as an optimal model for the training data. Finally, statistical significance of the optimal models is assessed by 1000 or 10000count permutation testing[11].
MDR has been applied to identify genegene interactions conferring susceptibility to common diseases, including hypertension[12], bladder cancer[13], Type 2 diabetes[14], and rheumatoid arthritis[15, 16]. Several extensions of the MDR method have been proposed. These methods entail the use of odds ratios[17], loglinear methods[18], generalized linear models[19], methods for data highly imbalanced with the disease outcome[20], modelbased methods[21], contingency table measures of classification accuracy[22] and familial data[23, 24].
In the present work, we propose an AggregatedMultifactor Dimensionality Reduction (AMDR) method that exhaustively searches for statistically significant genegene (GxG) interactions to generate a gene interaction network. In particular, an epistasis enriched risk score replaces the traditional dichotomous predisposing risk factor in quantifying the degree of susceptibility to a disease. We also introduce and compare new GxG interaction measures (pOR, pRR and pChi). An adjustment for multiple comparisons is implemented to limit false positive discoveries. In the current study we introduce the new approach, evaluate its performance in a range of simulations, and apply it to a real dataset from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX).
Method
The AMDR method proposed herein can be applied to detect interactions among alleles, genotypes, and other categorical explanatory variables. Without loss of generality, we present the AMDR method for SNPs with three common states (0  homozygous reference, 1  heterozygous, 2  homozygous variant). The steps of the AMDR are outlined in Figure1.
Detect multiple GxG interactions using the pOR, pRR or pChi test
The starting point for the AMDR method is the construction of a predisposing risk factor. Suppose we want to investigate k–way GxG interactions among M SNPs. For one particular k–way GxG interaction, there are 3^{k} (SNP _{(1)} = 0, 1, 2 × SNP _{(2)} = 0, 1, 2 × ⋯ × SNP _{(k)} = 0, 1, 2) different genotypic combinations. Denote these 3^{k} genotypic combinations as C _{ ij } where j = 1, 2, ⋯, 3^{k} stands for different genotypic combinations within one k–way GxG interaction. We need another subscripti=1,2,\cdots ,\left(\begin{array}{l}M\\ k\end{array}\right) to cover all k–way GxG interactions among _{ M } SNPs. Let X _{ ij } and Y _{ ij } be the numbers of affected (Test) and unaffected (Control) subjects in the j ^{th} genotypic combination of the i ^{th} k–way GxG interaction. Let{p}_{i0}=\left({\displaystyle \sum _{j=1}^{{3}^{k}}{X}_{\mathit{\text{ij}}}}\right)/\left({\displaystyle \sum _{j=1}^{{3}^{k}}{X}_{\mathit{\text{ij}}}+{\displaystyle \sum _{j=1}^{{3}^{k}}{Y}_{\mathit{\text{ij}}}}}\right)\in \left(0,1\right) be the threshold for disease risk above which a person is deemed highly susceptible using a Naïve Bayes classifier[25]. Genotypic combinations are then classified into highrisk predisposing risk groups and lowrisk predisposing risk groups, n _{11} n _{12} n _{21} n _{22}, in Table1 where I{_{·}} denotes an indicator function.
We propose to perform one of the following measures to assess the i ^{th} k–way GxG interaction (the subscript i is omitted for n’s and e’s):

(a)
the predisposing odds ratio (pOR),
\mathit{\text{pO}}{R}_{i}=\frac{{n}_{11}{n}_{22}/\left({n}_{12}{n}_{21}\right)}{{F}_{0}^{1}\left(F\left({n}_{11}{n}_{22}/\left({n}_{12}{n}_{21}\right)\right)\right)}\text{;}(1) 
(b)
the predisposing relative risk (pRR),
\mathit{\text{pR}}{R}_{i}=\frac{\frac{{n}_{11}/\left({n}_{11}+{n}_{12}\right)}{{n}_{21}/\left({n}_{21}+{n}_{22}\right)}}{{F}_{0}^{1}F\left(\frac{{n}_{11}/\left({n}_{11}+{n}_{12}\right)}{{n}_{21}/\left({n}_{21}+{n}_{22}\right)}\right)}\text{,}(2)
and

(c)
the predisposing chisquare (pChi) test statistic,
\mathit{\text{pCh}}{i}_{i}=\frac{{\displaystyle \sum _{s=1}^{2}{\displaystyle \sum _{t=1}^{2}\frac{{\left({n}_{\mathit{\text{st}}}{e}_{\mathit{\text{st}}}\right)}^{2}}{{e}_{\mathit{\text{st}}}}}}}{{F}_{0}^{1}F\left({\displaystyle \sum _{s=1}^{2}{\displaystyle \sum _{t=1}^{2}\frac{{\left({n}_{\mathit{\text{st}}}{e}_{\mathit{\text{st}}}\right)}^{2}}{{e}_{\mathit{\text{st}}}}}}\right)}\text{,}(3)
where{e}_{\mathit{\text{st}}}=\frac{{n}_{s+}{n}_{+t}}{N} is the expected number of subjects in predisposing risk stratum s (1 = high predisposing risk, 2 = low predisposing risk) and disease stratum t (1 = Case, 2 = Control) under a null hypothesis of no association between the predisposing risk factor and the disease. Details of F and F _{0}, along with permutation tests and 95% confidence intervals for pOR, pRR and pChi are in Appendix I.
Aggregate high risk from significant GxG interactions into risk scores
Assume a study investigates a total of N subjects. For the n ^{th} (n = 1, 2, ⋯, N) subject, an aggregated k– way epistasis enriched risk score, R(k,n), is defined by
In equation (4), we use the indicator variable,I\left\{\mathit{\text{Pvalu}}{e}_{i}<\widehat{\alpha}\right\}, to remove the i ^{th} nonsignificant GxG interaction from further calculation of risk scores. For the remaining significant GxG interactions, the indicator function, I{n ∈ C _{ ij }}I{X _{ ij }/(X _{ ij } + Y _{ ij }) > p _{ i 0}}, assigns 1 if the n ^{th} subject carries a high predisposing risk genotypic combination and 0 if the n ^{th} subject carries a low predisposing risk genotypic combination. More specifically, I{X _{ ij }/(X _{ ij } + Y _{ ij }) > p _{ i 0}} indicates whether the j ^{th} genotypic combination has high predisposing risk and I{n ∈ C _{ ij }} checks whether the n ^{th} subject carries the j ^{th} genotypic combination. Each subject’s risk scores are then summed over all 3^{k}genotypic combinations and all k– way GxG interactions to obtain an aggregated k– way epistasis enriched risk score, R(k,n), for the n ^{th} subject.
Finally, the epistasis enriched risk score, R(k,n), is treated as a diagnostic test for the disease. One can consider adding up the predisposing scores with p < 0.05 from these interactions. Our experience has been that, in many cases, α=0.05 is an arbitrary cutoff for pvalues and some GxG interactions with pvalue < 0.05 might have low predictive ability. Therefore, a receiver operating characteristic (ROC) curve is constructed, and the area under the ROC curve (AUC) provides an overall evaluation of the epistasis enriched risk score’s ability to predict disease susceptibility. Instead of using an arbitrary cutoff α=0.05, we propose to select\widehat{\alpha} that maximizes the AUC of ROC curves. Choosing\widehat{\alpha}=\underset{0\le \alpha \le 0.05}{\text{arg}\phantom{\rule{.2em}{0ex}}\text{max}}\left\{\mathit{\text{AUC}}\alpha \right\}\phantom{\rule{0.24em}{0ex}} for\widehat{\alpha}\le 0.05 may help focusing on a modest number of summands that are most conducive to correct classification (arising from the strongest GxG interactions), rather than diluting the epistasis enriched risk score R(k,n) by a large number of summands that are not as conducive to correct classification (arising from comparatively weaker GxG interactions).
Construct multiple GxG interactions into an epistasis enriched network
The significant GxG interactions used to define the epistasis enriched risk score are collectively referred to as an epistasis enriched network (Figure2A). Genes involved in one or more significant interactions appear as nodes in a radial graph. Pairs of genes sharing significant interactions are connected by lines. Each line is labeled with the corresponding pOR, and the line thickness may be chosen to accentuate the strongest pORs.
In summary, the AMDR method has not only replaced the dichotomous predisposing risk factor with a continuous predictive variable, R(k,n), but has done so by integrating numerous significant GxG interactions into an epistasis enriched network that may more adequately explain the susceptibility to complex diseases. Epistasis enriched risk scores may also be accumulated over GxG interactions from multiple dimensions. For instance, we can accumulate both twoway and threeway GxG interactions. Or further consider accumulating one way main effects and up to k– way GxG interactions. The feasibility of these extensions needs to be assessed by future studies.
Empirical assessment
Extensive Monte Carlo simulations were performed to assess the performance of pOR, pRR and pChi and compare them to the original MDR in unrelated case–control studies. To avoid subjective selections of models in favor of our methods, we report the power and type I error simulation similar to the models previously assessed by[26]. In each model, we simulated five SNPs with common homozygote (AA), heterozygote (Aa) and rare homozygote (aa). The minor allele frequencies (MAF) in each model were set to be 0.5 and 0.25 respectively and the genotypes were generated according to proportional expectations under HardyWeinberg equilibrium and linkage equilibrium. Equal numbers of affected and unaffected subjects were generated based on penetrance functions given in Table2. Let D _{=1} indicate the onset of disease and l 1,l 2...,l 5 stand for five loci, P _{ l 1×l 2} and P _{ l 4×l 5} be penetrance functions from loci 1x2 and loci 4x5 respectively as listed in Table2. Our simulations comprised three major scenarios:

A)
Existence of only one twoway interaction in loci 1x2 associated with disease susceptibility while the remaining loci were unrelated to the outcome variable, i.e. P(D = 1l 1, l 2) = p _{ l 1 × l 2}.

B)
Genetic heterogeneity models where a proportion of affected subjects are linked with interactions between loci 1 and 2 and the rest of affected subjects are linked with interactions between loci 4 and 5, i.e.
\begin{array}{ll}\phantom{\rule{.8em}{0ex}}P\left(D\right.& \left(=1,Cl1,l2,l4,l5\right)=P\left(D=1l1,l2,l4,l5,C=0\right)\times P\left(C=0\right)\hfill \\ +P\left(D=1l1,l2,l4,l5,C=1\right)\times P\left(C=1\right)=P\left(D=1l1,l2\right)\times P\left(C=0\right)\\ +P\left(D=1l4,l5\right)\times P\left(C=1\right)={\gamma}_{1}{p}_{l1\times l2}+{\gamma}_{2}{p}_{l4\times l5}\end{array}
where γ _{1}≥0,γ _{2}≥0 and γ _{1}+γ _{2}=1. The latent binary variable C labels the source of genetic variation where C=0 indicates that disease is related to loci 1x2 with P(C=0)=γ _{1} and C=1 indicates that disease is related to loci 4x5 with P(C=1)=γ_{2}. In this study, we consider balanced genetic heterogeneity models with γ_{1}=γ_{2}=0.5 and unbalanced genetic heterogeneity models with γ _{1}=0.7 and γ _{2}=0.3.

C)
Additive models with two pairs of loci jointly contributing to disease susceptibility. Let D _{ l 1×l 2}=1 denote the onset of disease due to the penetrance P _{ l 1×l 2} from loci 1*2 and D _{ l 4×l 5}=1 due to the penetrance P _{ l 4×l 5} from loci 4*5. The susceptibility function is given by
\begin{array}{ll}\phantom{\rule{.8em}{0ex}}P(D=1l1,l2,l4,l5)& =P\left(\left({D}_{l1\mathit{xl}2}=1\right)\cup \left({D}_{l4\mathit{xl}5}=1\right)l1,l2,l4,l5\right)=P\left({D}_{l1\mathit{xl}2}=1l1,l2,l4,l5\right)\hfill \\ +P\left({D}_{l4\mathit{xl}5}=1l1,l2,l4,l5\right)P\left(\left({D}_{l1\mathit{xl}2}=1\right)\cap \left({D}_{l4\mathit{xl}5}=1\right)l1,l2,l4,l5\right)\\ =P\left({D}_{l1\mathit{xl}2}=1l1,l2\right)+P\left({D}_{l4\mathit{xl}5}=1l4,l5\right)P\left({D}_{l1\mathit{xl}2}=1l1,l2\right)\\ \times P\left({D}_{l4\mathit{xl}5}=1l4,l5\right)={p}_{l1\times l2}+{p}_{l4\times l5}{p}_{l1\times l2}{p}_{l4\times l5}\end{array}
To assess the power of the AMDR method, we randomly generated 100 sets of data in the above described scenarios and performed the MDR and AMDR tests for each random sample. The power is the percentage of rejection of null hypothesis for the loci with a GxG interaction. Type I error is the percentage of rejection of null hypothesis when the simulated loci have no GxG interaction.
As shown in Table3, the Type I errors of all tests were under the nominal rate of 5%. When only loci 1 and 2 had an interaction (Table3A), all five measures had strong power to detect GxG interactions in most models except that the power of MDR dropped to 0.46 in model 3 with n=300, which could be enhanced by increasing sample size. We next simulated genetic heterogeneity (Table3B) with 0.5/0.5, 0.7/0.3 proportions of subjects affected by a mixture model of epistasis from loci 1x2 and loci 4*5. We noticed that MDR lost power to detect interactions with weaker effects. The power was not recovered with increased sample sizes. We last examined the additive models (Table3C) where susceptibility increases jointly through loci 1x2 and loci 4x5. MDR has low power to detect both pairs of GxG interactions. All this evidence suggests that the proposed AMDR might be a better choice for detecting complex GxG interactions, especially when multiple GxG interactions are cumulatively contributing to a phenotype.
Application to genotyping data
Juvenile Idiopathic Arthritis (JIA) is one of the most common chronic diseases of childhood, affecting an estimated 300,000 children in the U.S. alone, and is an important cause of morbidity and disability in children. Although methotrexate (MTX) is the most commonly used secondline antiinflammatory agent used to treat JIA worldwide, this antifolate prodrug has shown considerable interindividual variability in clinical response and adverse reactions. Thus far, variables investigated as potential useful predictors of response and toxicity in patients taking MTX, which is used alone and as an “anchor drug” for many rheumatic conditions, have not been clearly associated with outcomes. The effect of individual genetic SNP variation within the folate pathway upon MTX response has been investigated in several studies in adult rheumatoid arthritis (RA) and a few studies in JIA with conflicting results. To elucidate the genetic architecture impacting the efficacy of MTX, 34 SNPs from 19 folate pathway genes were measured in 104 subjects. Responsedefined as the absence of active arthritis (swelling not due to bony enlargement or, if no swelling was present, limitation of motion accompanied by either pain on motion or tenderness, not due to trauma or explained by prior joint damage[27]was determined for these subjects. Information pertaining to the 34 SNPs is listed in Table4; demographic and clinical characteristics of the subjects were described by Becker and colleagues[15]. After being on a stable dose and route of MTX for at least 3 months, 56.7% of the patients (59 out of 104) still had at least 1 active (i.e., swollen or tender) joint. The presence of active arthritis was the outcome variable, and represented an incomplete response to MTX. By definition, the absence of active arthritisno joint involvementwas considered a positive response to MTX treatment.
Standard logistic regression analysis did not identify significant main effects from SNPs or GxG interactions. This could primarily due to the complex and nonlinear interactions among SNPs. For the rest article, the AMDR and original MDR methods were applied to the MTX data to search for genetic predictors of response to MTX. Redundant SNPs and SNPs with no prediction of the phenotype were removed by the ReliefF algorithm[7]. A complete set of 34 SNPs and 7 filtered SNPs were analyzed respectively.
The original MDR analysis method was applied to obtain the onelocus, twolocus, and threelocus models with the highest validation accuracy in the original MDR. Twolocus interactions between genes ATIC and MTHFD2 were significant in testing accuracy but not in CVC. The prediction accuracy from the optimal MDR model was 75%.
We then utilized pOR, pRR, and pChi to identify and characterize GxG interactions in the AMDR analyses. Numerous twolocus GxG interactions were significantly associated with efficacy of MTX based on pOR, pRR, and pChi. We found that pChi flagged the most GxG interactions as significantly associated with efficacy of MTX. After we used the ReliefF algorithm to narrow down the list of candidate SNPs to 7, 15 pairs GxG interactions were flagged as significant after FDR correction. Table5 lists pOR, pRR, and pChi along with 95% confidence intervals, unadjusted pvalues, and FDRadjusted pvalues for twolocus GxG interactions among 7 filtered SNPs. For the most part, the three measures of GxG interactions (pOR, pRR, and pChi) yielded unadjusted pvalues that were in qualitative agreement. The epistasisenriched network based on the 15 significant twolocus GxG interactions from seven SNPs in five genes appears in Figure2A.
Another goal of our AMDR analysis was to integrate numerous significant GxG interactions into a continuous epistasis enriched risk score for the prediction of which patients would have active arthritis despite MTX treatment. A higher epistasis enriched risk score would indicate that a patient carried more highrisk genotypic combinations in loci with significant GxG interactions, and vice versa. To compare prediction accuracies based on the number of candidate SNPs as well as the presence or absence of adjustment for multiple comparisons, we generated epistasis enriched risk scores from 82 significant GxG interactions from 34 SNPs (Figure2B).
Subjects with persistent active arthritis had significantly higher mean and median epistasis enriched risk scores compared to subjects without active arthritis (p < 0.0001). When 82 GxG interactions from 34 SNPs with unadjusted pvalues < 0.0167 were used to generate epistasis enriched risk scores (Figure2B boxplot inset), these scores ranged from 0 to 44. A higher risk score suggests that a subject is less likely to respond favorably to MTX treatment. The ROC curve assessing the overall ability of the epistasis enriched risk score to distinguish between subjects with active joints and subjects without active joints had 85% area under the curve (p < 0.0001). (The 0.0167 cutoff for unadjusted pvalues was chosen to maximize this area.) We correctly classify 82% of the subjects if we predict that those with epistasis enriched risk scores above 11.5 have active joints and that those with epistasis enriched risk scores below 11.5 do not have persistent joint involvement.
Examination of the five genes in the 15interaction model presented in Figure2A reveals a testable hypothesis for future studies. All genes fall within a pathway leading to purine biosynthesis and adenosine formation: SLC25A32 transports folates from the cytoplasm to mitochondria; MTHFD2 is a component of the mitochondrial folate pathway that produces onecarbon donors in the form of formate (10formyltetrahydrofolate) exclusively to support de novo purine biosynthesis; and ITPA, ATIC, and GART are involved in purine biosynthesis. Thus, all genes map to a core pathway associated with adenosine accumulation, which is considered to be a mechanism of action of MTX that contributes to response in JIA and Rheumatoid Arthritis.
Discussion
In this work, we have proposed an Aggregated Multifactor Dimensionality Reduction (AMDR) model to elucidate complex and nonlinear genetic associations contributing to disease risk and variability in response to treatment. The proposed method is innovative in three important ways: 1) a continuous GxG enriched risk score is generated to replace the dichotomous risk factor in prediction of susceptibility to disorders; 2) new measures of genegene interaction using pOR, pRR, and pChi along with pvalues and confidence intervals are proposed to detect and characterize multiple genegene interactions; and, 3) a radial network is generated to depict patterns of epistasis. This approach allows for prediction on not just a single interactive model, which is important given the growing appreciation in human genetics for the accumulative impact of a large number of variants with low effect size[28]. By pooling moderate and interrelated genetic contributors together, the AMDR model becomes robust and predictive of complex traits. In addition to GxG interactions, the AMDR can also be applied to model geneenvironment interactions where environmental risk factors such as smoking, alcohol consumption, exercise, and diet can be incorporated into multifactorial models.
The original MDR model selects an optimal multifactorial (SNP) combination for each twoway, threeway or higher order interaction. When multiple genes function together in a pathway, the original MDR is prone to overlook genes with weaker signals and lose power for selecting one optimal GxG interaction in crossvalidation. For the MTX data, the optimal twolocus interaction detected by the original MDR among 7 candidate SNPs was ATIC (rs4673990) + MTHFD2 (rs12196) with testing accuracy of 0.73 (p=0.0005). However, there exist other pairs of interactions with comparable accuracy. As a result, CVC, which measures the percentage of times that an optimal GxG interaction is selected when splitting the training and validation sets randomly, was not significant (CVC=8/10, p=0.2700). Our AMDR analysis in Table5 identified 15 pairs of twolocus interactions. When multiple GxG interactions with bioequivalent effects are involved in epistasis, the original MDR will select an optimal model, by chance and lose some of the real pathwaybased signals. The recent extended MDR methods, including ORMDR[17], LMMDR[18] and GMDR[19], adopt the same strategy of selecting one optimal GxG interaction as does the original MDR, which means they have the same limitations.
A continuous GxG enriched risk score is another major distinction between AMDR and all the majority of existing MDR models, in which a binary risk factor is utilized to predict the outcome variable. For Mway interactions, the existing MDR models classify ~3 ^{M} genotypic combinations as either highrisk or lowrisk. AMDR evolves from the traditional MDR outputs to the predisposing risk scores and epistasis based network as shown in Figure2.
Another important result of the simulation experiments is the potential of AMDR to detect models that include genetic heterogeneity. Previous work with the original MDR has shown that heterogeneity is disastrous when using MDR to detect interactions[26][29]. Because of the use of the continuous enrichment score, AMDR is less impacted by heterogeneity in the enclosed simulations. Further evaluation of this initial result with expanded simulations and real data applications will be an important next step.
We explore a radial network (Figure2A) to depict patterns of epistasis. From the systems biology perspective, genetic variants might jointly impact the disease susceptibility and response to treatment. The genegene interaction network reveals intriguing information when interpreted in the context of what we know about the folate pathway and the effect that MTX has upon the disruption of this pathway as it relates to arthritis. ATIC and MTHFD2 were the two genes with the strongest interaction, and it is of interest to note that the genes included in the model (Figure2A) include a transporter involved in folate uptake into mitochondria, SLC25A32, and the bifunctional methylenetetrahydrofolate dehydrogenasecyclohydrolase MTHFD2, a key constituent of the mitochondrial folate pathway. The mitochondrial folate pathway is responsible for the generation of formate (in the form of 10formylTHF) specifically to support purine biosynthesis, represented by ATIC, GART, and ITPA. The antiinflammatory effect of lowdose MTX used to treat JIA and RA is thought to be due the antiinflammatory effects of adenosine, formed as a consequence of the inhibitory effects of MTX on aminoimidazole carboxamide ribonucleotide (AICAR) transformylase (gene name, ATIC), which promotes the accumulation of AICAR ribotide, inhibiting adenosine deaminase and leading to a build up of adenosine, a potent antiinflammatory agent[30]. A disruption of this process may result in a decreased antiinflammatory effect of the drug. Therefore, the combined effect of SNPs in ATIC and MTHFD2 may indeed yield a more clinically apparent result by altering the antiinflammatory effects of methotrexate. There is a potential to apply the proposed method to GWAS study by dissecting SNPs into pathways in order to detect GxG interactions in GWAS pathways. The major computational challenges from the proposed AMDR and other approach in MDR framework are in the generation of pvalues for MDR. MDR permutation computing time is largely dependent on the dimension of data sets. In other words, the computing time increases as the number of SNPs and/or the number of subjects increases. Several works have been devoted to improve the efficiency and shorten the computing time in MDR analysis in highthroughput data[5, 31, 32]. We will defer interested readers to the corresponding citations for computing issues in highthroughput MDR analysis. These computational limitations make our strategy appropriate in large scale candidate gene studies, but may be limited in application to genomewide association studies until further improvements in computing speed are realized or very largescale computing resources are available.
In summary, bioinformatics challenges remain in detecting and modeling epistasis in complex biological traits. We have developed a new AMDR framework to interpret complex genetic variation and have proposed predicting an outcome using a continuous risk factor. Several other extensions and modifications of the original MDR have been proposed in the literature. Incorporation of valuable features from other MDR extension models into the AMDR framework is worth further investigation. Prospective studies and validation in independent samples are needed to assess reliability of the AMDR model’s predictive ability. Tools for statistical inference, including asymptotic distributions of the proposed test statistics, need to be developed to save computing time and improve reliability.
Appendix
Appendix I. Justification, 95% confidence intervals and permutation tests of pOR, pRR and pChi.
Since the predisposing risk factor (Table1) is conditioned on the naïve Bayes classifier, standard inference procedures based on normal or chisquare asymptotic distributions with 1 degree of freedom do not apply to the numerators in (1)(3), which are the unadjusted odds ratio (OR), relative risk (RR) and Chisquare statistics (Chi). As a result, 95% confidence intervals of OR and RR are often greater than 1 under H_{0}. To address this issue, we propose pOR, pRR and pChi by taking the null distribution of unadjusted statistics into account. Let x=pOR,pRR, or pChi, F(x) be the cumulative distribution function of the corresponding statistic under the alterative hypothesis (GxG interaction present) F _{0}(x),be the cumulative distribution function of the corresponding statistic under the null hypothesis (GxG interaction absent) and F _{0} ^{− 1}(x) be the inverse function of F _{0}(x). The corrected pOR, pRR and pChi are then defined by\frac{x}{{F}_{0}^{1}\left(F\left(x\right)\right)}. Under H _{0}, F(x)=F _{0}(x), so pOR, pRR and pChi should equal\frac{x}{{F}_{0}^{1}\left(F\left(x\right)\right)}=\frac{x}{{F}_{0}^{1}\left({F}_{0}\left(x\right)\right)}=1. This adjustment will ensure the insignificant GxG interactions to have 95% confidence interactions cross 1 under Under H _{0}. In this work, we evaluated pOR, pRR and pChi using a full data set while these methods can also be evaluated under the cross validation scheme typically used in MDR.
The functions F(x) and F _{0}(x) can be estimated by the corresponding empirical distribution function. Permutation is applied to estimate F _{0}(x) by reshuffling the relationship between SNPs and a phenotype, where SNPs for each individual in a system are maintained as a vector to preserve their correlation structure. For each permutation, we generated odds ratio (OR), relative risk (RR) and chisquare test statistic (Chi). Jackknife resampling was applied to estimate F(x) by generating random subsets of data, where 80% to 90% of subjects were randomly selected. SNPs and the phenotype from each subject are maintained as a vector to preserve the association between SNPs and the phenotype. Denote the OR, RR and Chi statistic from permutation or resampling as x _{1},x _{2},…x _{ B } where B is the number of permutation or resampling. The null distribution function F _{0}(x) and F(x) can be estimated by{B}^{1}{\displaystyle {\sum}_{i=1}^{B}{I}_{\left\{{x}_{i}\le x\right\}}}. The 95% confidence interval pOR, pRR and pChi can be obtained by resampling. Denote the pOR, pRR and pChi statistics from resampling or permutation as z _{1},z _{2}…,z _{B} then the 95% confidence interval pOR, pRR and pChi is the interval from 2.5 to 97.5 percentile of z _{1},z _{2}…,z _{B} from resampling. The pvalue for pOR, pRR and pChi for the i ^{th} GxG interaction, denoted by Pvalue _{ i }, will be calculated by the permutation testing, i.e.\mathit{\text{Pvalu}}{e}_{i}={B}^{1}{\displaystyle {\sum}_{i=1}^{B}{I}_{\left\{z<{z}_{i}\right\}}} where z _{1},z _{2}…,z _{B} are calculated from permutation samples and z is the pOR, pRR and pChi statistic calculated from the current data.
References
Moore JH: Detecting, characterizing, and interpreting nonlinear genegene interactions using multifactor dimensionality reduction. Adv Genet. 2010, 72: 101116.
Cantor RM, Lange K, Sinsheimer JS: Prioritizing GWAS results: a review of statistical methods and recommendations for their application. Am J Hum Genet. 2009, 86 (1): 622.
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69 (1): 138147. 10.1086/321276.
Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics. 2003, 19 (3): 376382. 10.1093/bioinformatics/btf869.
Bush WS, Dudek SM, Ritchie MD: Parallel multifactor dimensionality reduction: a tool for the largescale analysis of genegene interactions. Bioinformatics. 2006, 22 (17): 21732174. 10.1093/bioinformatics/btl347.
Winham SJ, MotsingerReif AA: An R package implementation of multifactor dimensionality reduction. BioData Min. 2011, 4 (1): 24. 10.1186/17560381424.
RobnikSiknja M, Kononeko I: Theoretical and empirical analysis of RelifF and RReliefF. Mach Learn. 2003, 53: 2369. 10.1023/A:1025667309714.
Dai H, Bhandary M, Becker ML, Leeder SJ, Gaedigk R, MotsingerReif AA: Global tests of pvalues for multifactor dimensionality reduction models in selection of optimal number of target genes. BioData Min. 2012, 5 (1): 3. 10.1186/1756038153.
Motsinger AA, Ritchie MD: The effect of reduction in crossvalidation intervals on the performance of multifactor dimensionality reduction. Genet Epidemiol. 2006, 30 (6): 546555. 10.1002/gepi.20166.
Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Predication. 2001, NewYork, USA: Springer,
Good P: Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. 2000, New York, USA:Springer,
Moore JH, Williams SM: New strategies for identifying genegene interactions in hypertension. Ann Med. 2002, 34 (2): 8895. 10.1080/07853890252953473.
Andrew AS, Karagas MR, Nelson HH, Guarrera S, Polidoro S, Gamberini S, Sacerdote C, Moore JH, Kelsey KT, Demidenko E: DNA repair polymorphisms modify bladder cancer risk: a multifactor analytic strategy. Hum Hered. 2008, 65 (2): 105118. 10.1159/000108942.
Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS: Multifactordimensionality reduction shows a twolocus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004, 47 (3): 549554. 10.1007/s0012500313213.
Becker ML, Gaedigk R, van Haandel L, Thomas B, Lasky A, Hoeltzel M, Dai H, Stobaugh J, Leeder JS: The effect of genotype on methotrexate polyglutamate variability in juvenile idiopathic arthritis and association with drug response. Arthritis Rheum. 2011, 63 (1): 276285. 10.1002/art.30080.
Dervieux T, Wessels JA, van der Straaten T, Penrod N, Moore JH, Guchelaar HJ, Kremer JM: Genegene interactions in folate and adenosine biosynthesis pathways affect methotrexate efficacy and tolerability in rheumatoid arthritis. Pharmacogenet Genomics. 2009, 19 (12): 935944. 10.1097/FPC.0b013e32833315d1.
Chung Y, Lee SY, Elston RC, Park T: Odds ratio based multifactordimensionality reduction method for detecting genegene interactions. Bioinformatics (Oxford England). 2007, 23 (1): 7176. 10.1093/bioinformatics/btl557.
Lee SY, Chung Y, Elston RC, Kim Y, Park T: Loglinear modelbased multifactor dimensionality reduction method to detect gene gene interactions. Bioinformatics (Oxford England). 2007, 23 (19): 25892595. 10.1093/bioinformatics/btm396.
Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, Elston RC, Li MD: A generalized combinatorial approach for detecting genebygene and genebyenvironment interactions with application to nicotine dependence. Am J Hum Genet. 2007, 80 (6): 11251137. 10.1086/518312.
Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, Moore JH: A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction. Genet Epidemiol. 2007, 31 (4): 306315. 10.1002/gepi.20211.
Calle ML, Urrea V, Vellalta G, Malats N, Steen KV: Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat Med. 2008, 27 (30): 65326546. 10.1002/sim.3431.
Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD: Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinformatics. 2008, 9: 238. 10.1186/147121059238.
Lou XY, Chen GB, Yan L, Ma JZ, Mangold JE, Zhu J, Elston RC, Li MD: A combinatorial approach to detecting genegene and geneenvironment interactions in family studies. Am J Hum Genet. 2008, 83 (4): 457467. 10.1016/j.ajhg.2008.09.001.
Mei H, Cuccaro ML, Martin ER: Multifactor dimensionality reductionphenomics: a novel method to capture genetic heterogeneity with use of phenotypic variables. Am J Hum Genet. 2007, 81 (6): 12511261. 10.1086/522307.
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006, 241 (2): 252261. 10.1016/j.jtbi.2005.11.036.
Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24 (2): 150157. 10.1002/gepi.10218.
Giannini EH, Ruperto N, Ravelli A, Lovell DJ, Felson DT, Martini A: Preliminary definition of improvement in juvenile arthritis. Arthritis Rheum. 1997, 40 (7): 12021209.
Yang J, Manolio TA, Pasquale LR, Boerwinkle E, Caporaso N, Cunningham JM, de Andrade M, Feenstra B, Feingold E, Hayes MG: Genome partitioning of genetic variation for complex traits using common SNPs. Nat Genet. 2011, 43 (6): 519525. 10.1038/ng.823.
Ritchie MD, Edwards TL, Fanelli TJ, Motsinger AA: Genetic heterogeneity is not as threatening as you might think. Genet Epidemiol. 2007, 31 (7): 797800. 10.1002/gepi.20256.
Cronstein BN, Naime D, Ostad E: The antiinflammatory mechanism of methotrexate. Increased adenosine release at inflamed sites diminishes leukocyte accumulation in an in vivo model of inflammation. J Clin Invest. 1993, 92 (6): 26752682. 10.1172/JCI116884.
Oki NO, MotsingerReif AA: Multifactor dimensionality reduction as a filterbased approach for genome wide association studies. Front Genet. 2011, 2: 80
Yang C, Wan X, He Z, Yang Q, Xue H, Yu W: The choice of null distributions for detecting genegene interactions in genomewide association studies. BMC Bioinformatics. 2011, 12 (Suppl 1): S26. 10.1186/1471210512S1S26.
Acknowledgements
This work is supported for collaboration between HD and AMR by Bursary Award of the 1^{st} Short Course on Statistical Genetics and Genomics from University of Alabama at Birmingham from the National Institute of Health R25GM093044 (PI: Tiwari). Special thanks to two reviewers for instructive comments to help us improve the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
There are no competing interests to this work.
Authors’ contributions
HD conceived of the study. RC and AMR aided in study design and statistical method. HD performed the simulations and data analysis. MB and SL performed the clinical data collection, genotyping and interpretation of case study findings. All authors contributed to the manuscript writing. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Dai, H., Charnigo, R.J., Becker, M.L. et al. Risk score modeling of multiple gene to gene interactions using aggregatedmultifactor dimensionality reduction. BioData Mining 6, 1 (2013). https://doi.org/10.1186/1756038161
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1756038161