Risk score modeling of multiple gene to gene interactions using aggregated-multifactor dimensionality reduction

Background Multifactor Dimensionality Reduction (MDR) has been widely applied to detect gene-gene (GxG) interactions associated with complex diseases. Existing MDR methods summarize disease risk by a dichotomous predisposing model (high-risk/low-risk) from one optimal GxG interaction, which does not take the accumulated effects from multiple GxG interactions into account. Results We propose an Aggregated-Multifactor Dimensionality Reduction (A-MDR) method that exhaustively searches for and detects significant GxG interactions to generate an epistasis enriched gene network. An aggregated epistasis enriched risk score, which takes into account multiple GxG interactions simultaneously, replaces the dichotomous predisposing risk variable and provides higher resolution in the quantification of disease susceptibility. We evaluate this new A-MDR approach in a broad range of simulations. Also, we present the results of an application of the A-MDR method to a data set derived from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX) that revealed several GxG interactions in the folate pathway that were associated with treatment response. The epistasis enriched risk score that pooled information from 82 significant GxG interactions distinguished MTX responders from non-responders with 82% accuracy. Conclusions The proposed A-MDR is innovative in the MDR framework to investigate aggregated effects among GxG interactions. New measures (pOR, pRR and pChi) are proposed to detect multiple GxG interactions.


Background
Human diseases usually have complex inheritance patterns, and single nucleotide polymorphisms (SNPs) has been utilized to explain the variation in susceptibility to many common complex diseases as well as the response to drug therapy. The advancement of genotyping technology has made genotypic data readily accessible to investigators at low cost. However, many challenges remain with regard to identifying genes that render people susceptible to non-Mendelian disorders and in understanding the associations and functional relationships among genes. More and more, researchers have been advocating for advanced statistical analysis to quantify complex and interactive biological and genetic relationships [1,2].
Multifactor Dimensionality Reduction (MDR) is a statistical paradigm for characterizing and detecting nonlinear complex gene-to-gene interactions (epistasis) possibly associated with susceptibility to disease [3]. When numerous genes are involved in a complex canonical pathway or network, traditional approaches for data analysis, such as a Chi-square test or Fisher's exact test, might not detect the associations between risk factors and outcomes since these approaches assess only marginal main effects of the identified risk factors. Although one can employ logistic regression or other standard multivariate categorical data analysis approaches to explore interactions among SNPs, there are an enormous number of possible interactions in a model with both linear and nonlinear effects. Consequently, standard multivariate categorical data analysis approaches might detect very few interactions, and even then the cost in terms of sample size might be immense. MDR addresses these difficulties by converting high-dimensional genotypic data into a single predictive variable. Genotypic combinations are used to define high risk and low risk strata for the one-dimensional predisposing risk factor. MDR can reveal non-linear epistasis at a moderate sample size with no requirements on the underlying distributions of genotypes or outcomes [4].
The most commonly used MDR approach is described in detail by Ritchie et al. [3]. To distinguish this method from its various extensions, we will refer to it as the original MDR method. Related statistical software has been developed by Hahn et al. [4], Bush et al. [5], Winham and Motsinger-Reif [6], and Moore and colleagues (www.epistasis.org). In general, the MDR process can also be combined with a filter preprocess step by first applying global testing and filtration techniques to select the optimal number of SNPs for MDR analysis by searching for a subset of genes likely to interact with other genes using the ReliefF filtering process [7,8].
Details of the MDR [3] are briefly described here. MDR performs an exhaustive search of all variables and variable combinations to identify univariate or multivariate disease risk models. For each locus or multi-locus combination, attribute construction is performed to make a single variable with two categories: high risk and low-risk. A genotype or combination of genotypes is assigned high risk status if the ratio of affected subjects to unaffected subjects exceeds a pre-determined threshold, and low-risk otherwise. This step consolidates the high-dimensional risk space into a one-dimensional predictive variable. Typically, a 5fold or 10-fold cross-validation procedure is employed, beginning with the random division of the original data set into five or ten subsets of approximately equal sizes [9]. For 10-fold cross-validation, a model is fit using nine of the ten subsets (collectively referred to as training data), and then the model is applied to classify observations in both the training data and the tenth subset not used to fit the model (referred to as validation data). This entire process is repeated ten times, with one of the ten subsets acting as the validation data [10]. The model's training accuracy and testing accuracy are defined as the percentage of correct classifications in the corresponding data sets. The optimal one-locus, two-locus, and threelocus MDR models with the highest testing accuracy are identified. A one-locus model estimates the main effect of each SNP, while multi-locus models investigate the interactions among relevant SNPs. The cross validation consistency (CVC) is the number of times in a 10-fold cross validation that a particular multifactorial combination is identified as an optimal model for the training data. Finally, statistical significance of the optimal models is assessed by 1000-or 10000-count permutation testing [11].
In the present work, we propose an Aggregated-Multifactor Dimensionality Reduction (A-MDR) method that exhaustively searches for statistically significant gene-gene (GxG) interactions to generate a gene interaction network. In particular, an epistasis enriched risk score replaces the traditional dichotomous predisposing risk factor in quantifying the degree of susceptibility to a disease. We also introduce and compare new GxG interaction measures (pOR, pRR and pChi). An adjustment for multiple comparisons is implemented to limit false positive discoveries. In the current study we introduce the new approach, evaluate its performance in a range of simulations, and apply it to a real dataset from Juvenile Idiopathic Arthritis patients treated with methotrexate (MTX).

Method
The A-MDR method proposed herein can be applied to detect interactions among alleles, genotypes, and other categorical explanatory variables. Without loss of generality, we present the A-MDR method for SNPs with three common states (0 -homozygous reference, 1 -heterozygous, 2 -homozygous variant). The steps of the A-MDR are outlined in Figure 1.

Detect multiple GxG interactions using the pOR, pRR or pChi test
The starting point for the A-MDR method is the construction of a predisposing risk factor. Suppose we want to investigate k-way GxG interactions among M SNPs. For one particular k-way GxG interaction, there are 3 k (SNP (1) = 0, 1, 2 × SNP (2) = 0, 1, 2 × ⋯ × SNP (k) = 0, 1, 2) different genotypic combinations. Denote these 3 k genotypic combinations as C ij where j = 1, 2, ⋯, 3 k stands for different genotypic combinations within one k-way GxG interaction. We need another subscript i ¼ 1; 2; ⋯; M k to cover all k-way GxG interactions among M SNPs. Let X ij and Y ij be the numbers of affected (Test) and unaffected (Control) subjects in the j th genotypic combination of the i th k-way GxG inter- Y ij ! ∈ 0; 1 ð Þ be the threshold for disease risk above which a person is deemed highly susceptible using a Naïve Bayes classifier [25]. Genotypic combinations are then classified into high-risk predisposing risk groups and low-risk predisposing risk groups, n 11 ,n 12 ,n 21 ,n 22 , in Table 1 where I{ · } denotes an indicator function. We propose to perform one of the following measures to assess the i th k-way GxG interaction (the subscript i is omitted for n's and e's): (a) the predisposing odds ratio (pOR), The significant GxG interactions used to define the epistasis enriched risk score are collectively referred to as an epistasis enriched network. See Figure 2A. The risk score will be aggregated (added) together over all k-way GxG interactions for each subject. Higher score indicates that a subject carries more high risk genotypic combinations. See Formula 4.
The AUC under ROC curve provides an overall evaluation of the epistasis enriched risk score's ability to predict disease risk. The final epistasis enriched risk score will be generated based on GxG interactions passing the p-value (or FDR) hurdle α that maximizes the AUC of ROC curves. See Non-significant GxG interactions will be removed from further analysis. Pvalues may be adjusted by multiplicity algorithms (i.e. FDR).
Finally, the epistasis enriched risk score, , is treated as a diagnostic test for the disease. See Figure 2B.
where e st ¼ n sþ n þt N is the expected number of subjects in predisposing risk stratum s (1 = high predisposing risk, 2 = low predisposing risk) and disease stratum t (1 = Case, 2 = Control) under a null hypothesis of no association between the predisposing risk factor and the disease. Details of F and F 0 , along with permutation tests and 95% confidence intervals for pOR, pRR and pChi are in Appendix I.

Aggregate high risk from significant GxG interactions into risk scores
Assume a study investigates a total of N subjects. For the n th (n = 1, 2, ⋯, N) subject, an aggregated k-way epistasis enriched risk score, R(k,n), is defined by In equation (4), we use the indicator variable, I Pvalue i <α f g , to remove the i th nonsignificant GxG interaction from further calculation of risk scores. For the remaining significant GxG interactions, the indicator function, I{n ∈ C ij }I{X ij /(X ij + Y ij ) > p i0 }, assigns 1 if the n th subject carries a high predisposing risk genotypic combination and 0 if the n th subject carries a low predisposing risk genotypic combination. More specifically, I{X ij /(X ij + Y ij ) > p i0 } indicates whether the j th genotypic combination has high predisposing risk and I{n ∈ C ij } checks whether the n th subject carries the j th genotypic combination. Each subject's risk scores are then summed over all 3 k genotypic combinations and all k-way GxG interactions to obtain an aggregated k-way epistasis enriched risk score, R(k,n), for the n th subject. Finally, the epistasis enriched risk score, R(k,n), is treated as a diagnostic test for the disease. One can consider adding up the predisposing scores with p < 0.05 from these interactions. Our experience has been that, in many cases, α=0.05 is an arbitrary cutoff for p-values and some GxG interactions with p-value < 0.05 might have low predictive ability. Therefore, a receiver operating characteristic (ROC) curve is constructed, and the area under the ROC curve (AUC) provides an overall evaluation of the epistasis enriched risk score's ability to predict disease susceptibility. Instead of using an arbitrary cutoff α=0.05, we propose to selectα that maximizes the AUC of ROC curves. Choosingα ¼ arg max 0≤α≤0:05 AUC α j g f forα≤0:05 may help focusing on a modest number of summands that are most conducive to correct classification (arising from the strongest GxG interactions), rather than diluting the epistasis enriched risk score R(k,n) by a large number of summands that are not as conducive to correct classification (arising from comparatively weaker GxG interactions).

Construct multiple GxG interactions into an epistasis enriched network
The significant GxG interactions used to define the epistasis enriched risk score are collectively referred to as an epistasis enriched network ( Figure 2A). Genes involved in one or more significant interactions appear as nodes in a radial graph. Pairs of genes sharing significant interactions are connected by lines. Each line is labeled with the corresponding pOR, and the line thickness may be chosen to accentuate the strongest pORs.
In summary, the A-MDR method has not only replaced the dichotomous predisposing risk factor with a continuous predictive variable, R(k,n), but has done so by integrating numerous significant GxG interactions into an epistasis enriched network that may more adequately explain the susceptibility to complex diseases. Epistasis enriched risk scores may also be accumulated over GxG interactions from multiple dimensions. For instance, we can accumulate both two-way and three-way GxG interactions. Or further consider accumulating one way main effects and up to k-way GxG interactions. The feasibility of these extensions needs to be assessed by future studies.

Empirical assessment
Extensive Monte Carlo simulations were performed to assess the performance of pOR, pRR and pChi and compare them to the original MDR in unrelated casecontrol studies. To avoid subjective selections of models in favor of our methods, we report the power and type I error simulation similar to the models previously assessed by [26]. In each model, we simulated five SNPs with common homozygote (AA), heterozygote (Aa) and rare homozygote (aa). The minor allele frequencies (MAF) in each model were set to be 0.5 and 0.25 respectively and the genotypes were generated according to proportional expectations under Hardy-Weinberg equilibrium and linkage equilibrium. Equal numbers of affected and unaffected subjects were generated based on penetrance functions given in Table 2. Let D =1 indicate the onset of disease and l1,l2.....,l5 stand for five loci, P l1×l2 and P l4×l5 be penetrance functions from loci 1x2 and loci 4x5 respectively as listed in Table 2. Our simulations comprised three major scenarios: A) Existence of only one two-way interaction in loci 1x2 associated with disease susceptibility while the remaining loci were unrelated to the outcome variable, i.e. P(D = 1|l1, l2) = p l1 × l2 . B) Genetic heterogeneity models where a proportion of affected subjects are linked with interactions between loci 1 and 2 and the rest of affected subjects are linked with interactions between loci 4 and 5, i.e.
C) Additive models with two pairs of loci jointly contributing to disease susceptibility. Let D l1×l2 =1 denote the onset of disease due to the penetrance P l1×l2 from loci 1*2 and D l4×l5 =1 due to the penetrance P l4×l5 from loci 4*5. The susceptibility function is given by To assess the power of the A-MDR method, we randomly generated 100 sets of data in the above described scenarios and performed the MDR and A-MDR tests for each random sample. The power is the percentage of rejection of null hypothesis for the loci with a GxG interaction. Type I error is the percentage of rejection of null hypothesis when the simulated loci have no GxG interaction. As shown in Table 3, the Type I errors of all tests were under the nominal rate of 5%. When only loci 1 and 2 had an interaction (Table 3A), all five measures had strong power to detect GxG interactions in most models except that the power of MDR dropped to 0.46 in model 3 with n=300, which could be enhanced by increasing sample size. We next simulated genetic heterogeneity (Table 3B) with 0.5/0.5, 0.7/0.3 proportions of subjects affected by a mixture model of epistasis from loci 1x2 and loci 4*5. We noticed that MDR lost power to detect interactions with weaker effects. The power was not recovered with increased sample sizes. We last examined the additive models (Table 3C) where susceptibility increases jointly through loci 1x2 and loci 4x5. MDR has low power to detect both pairs of GxG interactions. All this evidence suggests that the proposed A-MDR might be a better choice for detecting complex GxG interactions, especially when multiple GxG interactions are cumulatively contributing to a phenotype.

Application to genotyping data
Juvenile Idiopathic Arthritis (JIA) is one of the most common chronic diseases of childhood, affecting an estimated 300,000 children in the U.S. alone, and is an important cause of morbidity and disability in children. Although methotrexate (MTX) is the most commonly used second-line anti-inflammatory agent used to treat JIA worldwide, this antifolate prodrug has shown considerable inter-individual variability in clinical response and adverse reactions. Thus far, variables investigated as potential useful predictors of response and toxicity in patients taking MTX, which is used alone and as an "anchor drug" for many rheumatic conditions, have not been clearly associated with outcomes. The effect of individual genetic SNP variation within the folate pathway upon MTX response has been investigated in several studies in adult rheumatoid arthritis (RA) and a few studies in JIA with conflicting results. To elucidate the genetic architecture impacting the efficacy of MTX, 34 SNPs from 19 folate pathway genes were measured in 104 subjects. Response-defined as the absence of active arthritis (swelling not due to bony enlargement or, if no swelling was present, limitation of motion accompanied by either pain on motion or tenderness, not due to trauma or explained by prior joint damage [27]-was determined for these subjects. Information pertaining to the 34 SNPs is listed in Table 4; demographic and clinical characteristics of the subjects were described by Becker and colleagues [15]. After being on a stable dose and route of MTX for at least 3 months, 56.7% of the patients (59 out of 104) still had at least 1 active (i.e., swollen or tender) joint. The presence of active arthritis was the outcome variable, and represented an incomplete response to MTX. By definition, the absence of active arthritis-no joint involvement-was considered a positive response to MTX treatment. Standard logistic regression analysis did not identify significant main effects from SNPs or GxG interactions. This could primarily due to the complex and non-linear interactions among SNPs. For the rest article, the A-MDR and original MDR methods were applied to the MTX data to search for genetic predictors of response to MTX. Redundant SNPs and SNPs with no prediction of the phenotype were removed by the ReliefF algorithm [7]. A complete set of 34 SNPs and 7 filtered SNPs were analyzed respectively.
The original MDR analysis method was applied to obtain the one-locus, two-locus, and three-locus models with the highest validation accuracy in the original MDR. Twolocus interactions between genes ATIC and MTHFD2 were significant in testing accuracy but not in CVC. The prediction accuracy from the optimal MDR model was 75%.
We then utilized pOR, pRR, and pChi to identify and characterize GxG interactions in the A-MDR analyses. Numerous two-locus GxG interactions were significantly associated with efficacy of MTX based on pOR, pRR, and pChi. We found that pChi flagged the most GxG interactions as significantly associated with efficacy of MTX. After we used the ReliefF algorithm to narrow down the list of candidate SNPs to 7, 15 pairs GxG interactions were flagged as significant after FDR correction. Table 5 lists pOR, pRR, and pChi along with 95% confidence intervals, unadjusted p-values, and FDRadjusted p-values for two-locus GxG interactions among 7 filtered SNPs. For the most part, the three measures of GxG interactions (pOR, pRR, and pChi) yielded unadjusted p-values that were in qualitative agreement. The epistasis-enriched network based on the 15 significant two-locus GxG interactions from seven SNPs in five genes appears in Figure 2A.
Another goal of our A-MDR analysis was to integrate numerous significant GxG interactions into a continuous epistasis enriched risk score for the prediction of which patients would have active arthritis despite MTX treatment. A higher epistasis enriched risk score would indicate that a patient carried more high-risk genotypic combinations in loci with significant GxG interactions, and vice versa. To compare prediction accuracies based on the number of candidate SNPs as well as the presence or absence of adjustment for multiple comparisons, we generated epistasis enriched risk scores from 82 significant GxG interactions from 34 SNPs ( Figure 2B).
Subjects with persistent active arthritis had significantly higher mean and median epistasis enriched risk scores compared to subjects without active arthritis (p < 0.0001). When 82 GxG interactions from 34 SNPs with unadjusted p-values < 0.0167 were used to generate epistasis enriched risk scores ( Figure 2B boxplot inset), these scores ranged from 0 to 44. A higher risk score suggests that a subject is less likely to respond favorably to MTX treatment. The ROC curve assessing the overall ability of the epistasis enriched risk score to distinguish between subjects with active joints and subjects without active joints had 85% area under the curve (p < 0.0001). (The 0.0167 cutoff for unadjusted p-values was chosen to maximize this area.) We correctly classify 82% of the subjects if we predict that those with epistasis enriched risk scores above 11.5 have active joints and that those with epistasis enriched risk scores below 11.5 do not have persistent joint involvement.
Examination of the five genes in the 15-interaction model presented in Figure 2A reveals a testable hypothesis for future studies. All genes fall within a pathway leading to purine biosynthesis and adenosine formation: SLC25A32 transports folates from the cytoplasm to mitochondria; MTHFD2 is a component of the mitochondrial folate pathway that produces one-carbon donors in the form of formate (10-formyl-tetrahydrofolate) exclusively to support de novo purine biosynthesis; and ITPA, ATIC, and GART are involved in purine biosynthesis. Thus, all genes map to a core pathway associated  with adenosine accumulation, which is considered to be a mechanism of action of MTX that contributes to response in JIA and Rheumatoid Arthritis.

Discussion
In this work, we have proposed an Aggregated Multifactor Dimensionality Reduction (A-MDR) model to elucidate complex and non-linear genetic associations contributing to disease risk and variability in response to treatment. The proposed method is innovative in three important ways: 1) a continuous GxG enriched risk score is generated to replace the dichotomous risk factor in prediction of susceptibility to disorders; 2) new measures of gene-gene interaction using pOR, pRR, and pChi along with p-values and confidence intervals are proposed to detect and characterize multiple gene-gene interactions; and, 3) a radial network is generated to depict patterns of epistasis. This approach allows for prediction on not just a single interactive model, which is important given the growing appreciation in human genetics for the accumulative impact of a large number of variants with low effect size [28]. By pooling moderate and inter-related genetic contributors together, the A-MDR model becomes robust and predictive of complex traits. In addition to GxG interactions, the A-MDR can also be applied to model gene-environment interactions where environmental risk factors such as smoking, alcohol consumption, exercise, and diet can be incorporated into multi-factorial models. The original MDR model selects an optimal multi-factorial (SNP) combination for each two-way, three-way or higher order interaction. When multiple genes function together in a pathway, the original MDR is prone to overlook genes with weaker signals and lose power for selecting one optimal GxG interaction in cross-validation. For the MTX data, the optimal two-locus interaction detected by the original MDR among 7 candidate SNPs was ATIC (rs4673990) + MTHFD2 (rs12196) with testing accuracy of 0.73 (p=0.0005). However, there exist other pairs of interactions with comparable accuracy. As a result, CVC, which measures the percentage of times that an optimal GxG interaction is selected when splitting the training and validation sets randomly, was not significant (CVC=8/10, p=0.2700). Our A-MDR analysis in Table 5 identified 15 pairs of two-locus interactions. When multiple GxG interactions with bio-equivalent effects are involved in epistasis, the original MDR will select an optimal model, by chance and lose some of the real pathwaybased signals. The recent extended MDR methods, including OR-MDR [17], LM-MDR [18] and G-MDR [19], adopt the same strategy of selecting one optimal GxG interaction as does the original MDR, which means they have the same limitations.
A continuous GxG enriched risk score is another major distinction between A-MDR and all the majority of existing MDR models, in which a binary risk factor is utilized to predict the outcome variable. For M-way interactions, the existing MDR models classify~3 M genotypic combinations as either high-risk or low-risk. A-MDR evolves from the traditional MDR outputs to the predisposing risk scores and epistasis based network as shown in Figure 2.
Another important result of the simulation experiments is the potential of A-MDR to detect models that include genetic heterogeneity. Previous work with the original MDR has shown that heterogeneity is disastrous when using MDR to detect interactions [26] [29]. Because of the use of the continuous enrichment score, A-MDR is less impacted by heterogeneity in the enclosed simulations. Further evaluation of this initial result with expanded simulations and real data applications will be an important next step.
We explore a radial network (Figure 2A) to depict patterns of epistasis. From the systems biology perspective, genetic variants might jointly impact the disease susceptibility and response to treatment. The gene-gene interaction network reveals intriguing information when interpreted in the context of what we know about the folate pathway and the effect that MTX has upon the disruption of this pathway as it relates to arthritis. ATIC and MTHFD2 were the two genes with the strongest interaction, and it is of interest to note that the genes included in the model (Figure 2A) include a transporter involved in folate uptake into mitochondria, SLC25A32, and the bifunctional methylenetetrahydrofolate dehydrogenase-cyclohydrolase MTHFD2, a key constituent of the mitochondrial folate pathway. The mitochondrial folate pathway is responsible for the generation of formate (in the form of 10-formylTHF) specifically to support purine biosynthesis, represented by ATIC, GART, and ITPA. The anti-inflammatory effect of low-dose MTX used to treat JIA and RA is thought to be due the anti-inflammatory effects of adenosine, formed as a consequence of the inhibitory effects of MTX on amino-imidazole carboxamide ribonucleotide (AICAR) transformylase (gene name, ATIC), which promotes the accumulation of AICAR ribotide, inhibiting adenosine deaminase and leading to a build up of adenosine, a potent anti-inflammatory agent [30]. A disruption of this process may result in a decreased anti-inflammatory effect of the drug. Therefore, the combined effect of SNPs in ATIC and MTHFD2 may indeed yield a more clinically apparent result by altering the anti-inflammatory effects of methotrexate. There is a potential to apply the proposed method to GWAS study by dissecting SNPs into pathways in order to detect GxG interactions in GWAS pathways. The major computational challenges from the proposed A-MDR and other approach in MDR framework are in the generation of p-values for MDR. MDR permutation computing time is largely dependent on the dimension of data sets. In other words, the computing time increases as the number of SNPs and/or the number of subjects increases. Several works have been devoted to improve the efficiency and shorten the computing time in MDR analysis in high-throughput data [5,31,32]. We will defer interested readers to the corresponding citations for computing issues in highthroughput MDR analysis. These computational limitations make our strategy appropriate in large scale candidate gene studies, but may be limited in application to genome-wide association studies until further improvements in computing speed are realized or very large-scale computing resources are available.
In summary, bioinformatics challenges remain in detecting and modeling epistasis in complex biological traits. We have developed a new A-MDR framework to interpret complex genetic variation and have proposed predicting an outcome using a continuous risk factor. Several other extensions and modifications of the original MDR have been proposed in the literature. Incorporation of valuable features from other MDR extension models into the A-MDR framework is worth further investigation. Prospective studies and validation in independent samples are needed to assess reliability of the A-MDR model's predictive ability. Tools for statistical inference, including asymptotic distributions of the proposed test statistics, need to be developed to save computing time and improve reliability.
Since the predisposing risk factor (Table 1) is conditioned on the naïve Bayes classifier, standard inference procedures based on normal or chi-square asymptotic distributions with 1 degree of freedom do not apply to the numerators in (1)- (3), which are the unadjusted odds ratio (OR), relative risk (RR) and Chi-square statistics (Chi). As a result, 95% confidence intervals of OR and RR are often greater than 1 under H 0 . To address this issue, we propose pOR, pRR and pChi by taking the null distribution of unadjusted statistics into account. Let x=pOR,pRR, or pChi, F(x) be the cumulative distribution function of the corresponding statistic under the alterative hypothesis (GxG interaction present) F 0 (x),be the cumulative distribution function of the corresponding statistic under the null hypothesis (GxG interaction absent) and F 0 − 1 (x) be the inverse function of F 0 (x). The corrected pOR, pRR and pChi are then defined by . Under H 0 , F(x)=F 0 (x), so pOR, pRR and pChi should equal This adjustment will ensure the insignificant GxG interactions to have 95% confidence interactions cross 1 under Under H 0 . In this work, we evaluated pOR, pRR and pChi using a full data set while these methods can also be evaluated under the cross validation scheme typically used in MDR.
The functions F(x) and F 0 (x) can be estimated by the corresponding empirical distribution function. Permutation is applied to estimate F 0 (x) by reshuffling the relationship between SNPs and a phenotype, where SNPs for each individual in a system are maintained as a vector to preserve their correlation structure. For each permutation, we generated odds ratio (OR), relative risk (RR) and chi-square test statistic (Chi). Jackknife re-sampling was applied to estimate F(x) by generating random subsets of data, where 80% to 90% of subjects were randomly selected. SNPs and the phenotype from each subject are maintained as a vector to preserve the association between SNPs and the phenotype. Denote the OR, RR and Chi statistic from permutation or re-sampling as x 1 ,x 2 ,. . .x B where B is the number of permutation or resampling. The null distribution function F 0 (x) and F(x) can be estimated by B À1 X B i¼1 I x i ≤x f g . The 95% confidence interval pOR, pRR and pChi can be obtained by resampling. Denote the pOR, pRR and pChi statistics from resampling or permutation as z 1 ,z 2 . . .,z B then the 95% confidence interval pOR, pRR and pChi is the interval from 2.5 to 97.5 percentile of z 1 ,z 2 . . .,z B from resampling. The p-value for pOR, pRR and pChi for the i th GxG interaction, denoted by Pvalue i , will be calculated by the permutation testing, i.e. Pvalue i ¼ B À1 X B i¼1 I z<z i f g where z 1 ,z 2 . . .,z B are calculated from permutation samples and z is the pOR, pRR and pChi statistic calculated from the current data.