 SOFTWARE ARTICLE
 Open Access
 Open Peer Review
gammaMAXT: a fast multipletesting correction algorithm
 François Van Lishout^{1, 2}Email author,
 Francesco Gadaleta^{1, 2},
 Jason H. Moore^{3},
 Louis Wehenkel^{1, 2} and
 Kristel Van Steen^{1, 2}
 Received: 7 June 2015
 Accepted: 8 November 2015
 Published: 20 November 2015
Abstract
Background
The purpose of the MaxT algorithm is to provide a significance test algorithm that controls the familywise error rate (FWER) during simultaneous hypothesis testing. However, the requirements in terms of computing time and memory of this procedure are proportional to the number of investigated hypotheses. The memory issue has been solved in 2013 by Van Lishout’s implementation of MaxT, which makes the memory usage independent from the size of the dataset. This algorithm is implemented in MBMDR3.0.3, a software that is able to identify genetic interactions, for a variety of SNPSNP based epistasis models effectively. On the other hand, that implementation turned out to be less suitable for genomewide interaction analysis studies, due to the prohibitive computational burden.
Results
In this work we introduce gammaMAXT, a novel implementation of the maxT algorithm for multiple testing correction. The algorithm was implemented in software MBMDR4.2.2, as part of the MBMDR framework to screen for SNPSNP, SNPenvironment or SNPSNPenvironment interactions at a genomewide level. We show that, in the absence of interaction effects, teststatistics produced by the MBMDR methodology follow a mixture distribution with a point mass at zero and a shifted gamma distribution for the top 10 % of the strictly positive values. We show that the gammaMAXT algorithm has a power comparable to MaxT and maintains FWER, but requires less computational resources and time. We analyze a dataset composed of 10^{6} SNPs and 1000 individuals within one day on a 256core computer cluster. The same analysis would take about 10^{4} times longer with MBMDR3.0.3.
Conclusions
These results are promising for future GWAIs. However, the proposed gammaMAXT algorithm offers a general significance assessment and multiple testing approach, applicable to any context that requires performing hundreds of thousands of tests. It offers new perspectives for fast and efficient permutationbased significance assessment in largescale (integrated) omics studies.
Keywords
 Multiple testing
 Genomewide interaction studies
 MaxT
 Gamma distribution
 SNPenvironment interactions
 3order interactions
 Algorithmic
Background
 (1)
Minimize the amount of false discoveries.
 (2)
Achieve sufficient statistical power to detect relevant pairs.
 (3)
Reduce the computational burden implied by the high number of tests for interactions.
 (4)
Provide a versatile software package that accommodates different study designs and study features, including flexibility in trait measurement types and the possibility to adjust for important predictor variables and confounders.
Available software
Among the numerous software designed for pairwise or higherorder SNPSNP interactions, we recall BOOST [10], BiForce [11], epiGPU [12], EpiBlaster [13], GLIDE [14], Multifactor Dimensionality Reduction (MDR) [15, 16] and ModelBased Multifactor Dimensionality Reduction (MBMDR) [17, 18]. The following comparison of these approaches is mainly inspired from [19] who review and discuss several practical aspects GWAIs typically involve. BOOST is a software based on fast Boolean operations, to quickly search for epistasis associated with a binary outcome. Its main drawbacks are its inability to accommodate missing data and its necessity to perform a multiple testing correction outside the software package. BiForce is a regressionbased tool handling binary and continuous outcomes, that can take account of missing genotypes and has a builtin multiple testing correction algorithm. Although, the latter is based on a fast Bonferroni correction implementation, it leads to reduced power for GWAIs, as further discussed in Multipletesting correction Section. EpiBlaster, epiGPU and GLIDE are all GPUbased approaches. An obvious drawback of GPUdependent software is that it is tuned for a particular GPUinfrastructure. Therefore, users are advocated to acquire the exact same infrastructure and only experts can adapt the code to specific needs. Note that users willing to work on dedicated hardware to speed up the computations can even turn to fieldprogrammable gate array (FPGa) [20]. MDR is a nonparametric alternative to traditional regressionbased methods that converts two or more variables into a single lowerdimensional attribute. The end goal is to identify a representation that facilitates the detection of nonlinear or nonadditive interactions. Overfitting issues in MDR are solved via crossvalidation and permutations. Since the design of MDR, several adaptations have been made [21]. MBMDR breaks with the tradition of crossvalidation and invests computing time in permutationbased multiple multilocus significance assessments and the implementation of the most appropriate association test for the data at hand. It is able to correct for important main effects. Its main asset compared to the other methods is its versatility. MBMDR can for instance be used to highlight geneenvironment or genegeneenvironment interactions in relation to a trait of interest, while efficiently controlling type I error rates. The trait can either be expressed on a binary or continuous scale, or as a censored trait. MDMDR3.0.3 is a C++ software tool based on the MBMDR methodology, achieving good results regarding objectives (1), (2) and (4) [22, 23]. However, concerns about computational efficiency remain when scaling up to exhaustive genomewide interaction contexts. In this work we introduce a new version of the software, MDMDR4.2.2, based on a novel multipletesting correction algorithm, with the purpose of improving the performances along objective (3), with the same benefits as before regarding the other three ones.
Multipletesting correction
In GWAIs, the most global null hypothesis is that none of the SNPs pairs, nor their main effects, are associated with the trait. Testing each pair independently at level α does not control the overall FWER at level α; an adjustment is needed for the fact that multiple tests are performed. One such adjustment can be realized via a Bonferroni correction [24]. This is a so called singlestep procedure for strong FWER control. Singlestep methods tend to be conservative though and improvements in power can be achieved by so called stepdown procedures [25]. Among these we recall step down minP adjusted pvalues (minP) and step down maxT adjusted pvalues (maxT). These methods guarantee strong control of the FWER under the subset pivotality assumption and weak control under all conditions [26]. Both procedures are available in MDMDR3.0.3, the adjusted pvalues being estimated by permutation. Since a high number of pairs of SNPs are tested, minP tends to be more conservative than maxT [25]. Furthermore, minP requires more computations than maxT. For these reasons, maxT is the default choice in MDMDR3.0.3. Note that the drawback of maxT compared to minP, is that when the test statistics are not identically distributed unbalanced adjustments can be observed because not all tests contribute equally to the computed adjusted pvalues.
Van Lishout’s MaxT
(1) Compute the teststatistics for all pairs, but only store the n highest tests values. The result is a Real data 
vector where T _{0,1}≥T _{0,2}≥…≥T _{0,n }. 
(2) Initialise a vector p of size n with 1’s. 
(3) Perform the following operations for i=1,…,B: 
(a) Generate a random permutation of the trait column. 
(b) Compute T _{ i,1},…,T _{ i,n } and store them in a Permutation _{ i } vector. 
(c) Compute the maximum M _{ i } of the teststatistics values T _{ i,n+1},…,T _{ i,m }. 
(d) Replace T _{ i,n } by M _{ i } if T _{ i,n }<M _{ i }. 
(e) Force the monotonicity of the Permutation _{ i } vector: for j=n−1,…,1 replace T _{ i,j } by T _{ i,j+1} if T _{ i,j }<T _{ i,j+1}. 
(f) For each j=1,…,n, if T _{ i,j }≥T _{0,j } increment p _{ j } by one. 
(4) Divide all values of vector p by B+1 to obtain the pvalues vector. Force monotonicity: for j=1,…,n−1, 
replace p _{ j+1} by p _{ j } if p _{ j+1}<p _{ j }. 
Bottlenecks of Van Lishout’s maxT
Analysis of the computing times of the different steps of Van Lishout’s implementation of MaxT on a dataset containing 1 million SNPs
Theoretical value  Numerical value  

Step 1  O(m)  O(10^{11}) 
Step 2  O(n)  O(10^{3}) 
Step 3 (a)  O(B)  O(10^{3}) 
Step 3 (b)  O(B n)  O(10^{6}) 
Step 3 (c)  O(B m)  O(10^{14}) 
Step 3 (d)  O(B)  O(10^{3}) 
Step 3 (e)  O(B n)  O(10^{6}) 
Step 3 (f)  O(B n)  O(10^{6}) 
Step 4  O(n)  O(10^{3}) 
Table 2 reflects that in step 1 of Van Lishout’s maxT, as many elementary computations are carried out as there are SNP pairs to test. Although significance assessment can be based on fewer SNP pairs, this first step of computing test values and ordering them cannot be avoided nor simplified. However, the most computationally intensive part of the significance assessment procedure is step 3(c). With 10^{6} inputted SNPs, the number of elementary computations required is proportional to 10^{14}. Therefore, any improvement at this stage will lead to better overall performances. In “Methods” section, we introduce a novel algorithm for multiple testing, based on Van Lishout’s implementation of maxT. It is implemented in the software MBMDR4.2.2 and resolves remaining concerns about maxT’s computation time in genomewide screens for genetic interactions using the MBMDR framework.
Methods
In MBMDR4.2.2 the value of M _{ i } from Fig. 1 will be estimated from a sample from [ T _{ i,n+1},…,T _{ i,m }] rather than calculated exactly. A detailed explanation of how we perform such an improvement is provided in the next section.
Distribution of MBMDR statistics
We have indicated before that MBMDR offers a flexible framework to test for SNPSNP interactions. The software in which the framework is implemented has a modular builtup that allows a flexible choice of association test, depending on the input data. For instance, for quantitative traits, ttests or nonparametric equivalents can be carried out. For binary traits, chisquared test of independence can be chosen. The association test that best reflects the data at hand is used in both stage 1 and stage 2 of the MBMDR framework [27]. After the data manipulation of combining cells using trait information, MBMDR’s final test statistic no longer follows the theoretical sample distribution of the initially chosen test statistic. In fact, earlier work has shown that such sequential pooling may lead to permutationbased distributions of within MBMDR test statistics that depend on the number of multilocus genotype cells pooled [28] or on the minor allele frequencies (MAFs) of the SNP pair under consideration [29]. Rather than looking at the null distribution of the test statistic linked to a SNPpair, we are now interested in the distribution of a number of test values over several SNPpairs, from which to derive the maximum value M _{ i }. We hypothesize that test values in [ T _{ i,n+1},…,T _{ i,m }], with i>0, follow a mixture distribution of a shifted gamma distribution [30] and a point mass at zero. Note that zero test values are induced by scenarios for which the MBMDR test statistic cannot be computed. In MBMDR4.2.2, whenever a group of subjects (e.g., in a 2SNP interaction study, those subjects having two copies of the minor allele at each locus) is compared to the remaining subjects with respect to the trait under study and by using an appropriate association test statistic, this group can either be associated to a higher “risk” (“H” category), a lower “risk” (“L” category) or undecisive “risk” (nor “H”, nor “L”; “O” category) for the trait. Here, “risk” is used loosely. For instance for continuous traits, the “risk” categories above may rather refer to increased (“H” category), decreased (“L” category) mean trait values. Also, in the MBMDR methodology, risk scales can be refined to incorporate multiple risk categories. It is important to realize that if all subjects are assigned the same label (in this scenario, most probably the “O” label), then MBMDR will return an exact zero. It is not surprising that lack of power of GWAIs (which depends on sample size but also true effect size) will induce such technical zeros for a significant proportion of the tested SNP pairs. In order to take this important amount of zeros into account, we use the approach described in [31]. We assign a discrete probability mass to the exact zero value. Hence, if \({\mathcal {X}}_{i}\) is a random variable returning a random value from [ T _{ i,n+1},…,T _{ i,m }], with i>0, we can define the probabilities \(\pi = P({{\mathcal {X}}_{i}} > 0)\) and \(1\pi = P({{\mathcal {X}}_{i}} = 0)\). Therefore, the distribution of \({\mathcal {X}}_{i}\) is semicontinuous with a discontinuity at zero. This implies that the probability density function is \( {f}_{{\mathcal{X}}_i}(x)=\left(1\pi \right)\delta (x)+\pi {g}_{{\mathcal{X}}_i}(x){1}_{\left(x>0\right)} \), where δ(x) is a point probability mass at x=0, \(g_{{\mathcal {X}}_{i}}(x)\) is the distribution of the strictly positive values and \(\mathbbm {1}_{(x>0)}\) is an indicator function taking the value 1 if x>0 and 0 otherwise. The parameter π depends on the data at hand and can be estimated with the Maximum Likelihood Estimation (MLE) method [32] from the observed frequency in a sample from [ T _{ i,n+1},…,T _{ i,m }]. Due to the fact that our main goal consists in predicting a maximum, we are not particularly interested in fitting the distribution of \(g_{{\mathcal {X}}_{i}}(x)\) on the entire set of strictly positive values. As a matter of fact, fitting the tail of \(g_{{\mathcal {X}}_{i}}(x)\) should suffice. We show in the next section that focusing on the top 10 % strictly positive values is an acceptable practical choice. We consider this a good tradeoff between fitting on a large and a smaller range of positive values. The former might lead to a poor fit of the tail, because many samples might not belong to that range. The latter might lead to a poor fit of the tail due to an insufficient number of samples. The amount of values belonging to the top 10 % strictly positive values in [ T _{ i,n+1},…,T _{ i,m }] is given by \(q=\frac {(mn)\pi }{10}\).
Assumption 1
We assume that the shifted gamma distribution is a good fit to the tail of \(g_{{\mathcal {X}}_{i}}(x)\). Hence, if \({\mathcal {Y}}_{i}\) is a random variable returning a value from the aforementioned top 10 % of strictly positive values, we postulate that its cumulative distribution function (CDF) is given by \(F_{{\mathcal {Y}}_{i}}(\,y) = \frac {\gamma \left (k,\frac {yy_{0}}{\theta }\right)}{\Gamma (k)}\), where γ is the lower incomplete gamma function, y _{0} is the location parameter, k is the shape parameter and θ the scale parameter. Some authors discourage the use of the gamma distribution for model fitting due to the difficulty of parameter estimation [33]. However, in the specific case of fitting the tail of the distribution of the MBMDR statistics, we believe that simpler models would be consistently inaccurate. Moreover, the lack of knowledge regarding the shape of a plausible distribution and the diversity of the data we are performing our computations on, make a versatile distribution function like the gamma, a reasonable assumption. Note that the choice of shifting the gamma distribution comes naturally due to the fact that the smallest strictly positive value should not be in the neighborhood of zero. Indeed, a small value would represent a lowsignificant association between the interaction of the two loci and the phenotype. As previously mentioned, this would lead to the “O” category for all subjects and an exact zero. The CDF of the random variable \({\mathcal {Z}}_{i}\) returning the maximum of the q values belonging to the top 10 % strictly positive values in [ T _{ i,n+1},…,T _{ i,m }] is given by \(F_{{\mathcal {Z}}_{i}}(z) = \left [\frac {\gamma \left (k,\frac {zy_{0}}{\theta }\right)}{\Gamma (k)}\right ]^{q}\). Indeed, if we take q independent and identically distributed (i.i.d.) values y _{1},y _{2},…,y _{ q }, then \(P[\!(\,y_{1} \le y_{t})\wedge (\,y_{2} \le y_{t}) \wedge \ldots \wedge (\,y_{q} \le y_{t})] = [\!F_{{\mathcal {Y}}_{i}}(y_{t})]^{q} = F_{{\mathcal {Z}}_{i}}(z)\).
Assumption 2
Mean and variance of the fitted parameters for datasets D _{1}−D _{4}
D _{1}  D _{1}  D _{2}  D _{2}  D _{3}  D _{3}  D _{4}  D _{4}  

Mean  Var  Mean  Var  Mean  Var  Mean  Var  
π  0.337  1.247×10^{−6}  0.335  3.815×10^{−6}  0.137  4.948×10^{−7}  0.366  9.356×10^{−7} 
y _{0}  7.742  5.566×10^{−4}  7.825  8.778×10^{−4}  6.189  6.472×10^{−4}  7.788  3.805×10^{−4} 
k  1.017  2.612×10^{−4}  1.012  2.534×10^{−4}  0.990  3.580×10^{−4}  1.017  1.725×10^{−4} 
θ  1.917  1.462×10^{−3}  1.974  1.532×10^{−3}  1.694  1.829×10^{−3}  1.917  9.695×10^{−4} 
Estimating the parameters of the shifted gamma distribution
Step 3(c) of gammaMAXT
(1) If (i modulo 20 = 1) estimate π,y _{0},k and θ: 
(a) Set z=0. Create vector v of size S. 
(b) Randomly select integer r in [n+1,m]. 
(c) If T _{ i,r }=0, z=z+1, else store T _{ i,r } in v. 
(d) Repeat steps (b) and (c) until v is full. 
(e) Sort v. Remove the 90 % lowest values. The new size of v is \(N=\frac {S}{10}\). 
(f) Estimate \(\pi = \frac {S}{z + S}\). 
(g) Estimate y _{0} by the minimum of v. 
(h) Estimate k: see below. 
(i) Estimate \(\theta = \frac {1}{kN}\sum \limits _{i=1}^{N} (v[i]y_{0})\). 
(2) If (i modulo 20 ≠ 1), use the latest estimated values of π,y _{0},k and θ. 
(3) Sample M _{ i } from the distribution of the maximum, whose CDF is \(F_{{\mathcal {Z}_{i}}}(z) = \left [\frac {\gamma \left (k,\frac {zy_{0}}{\theta }\right)}{\Gamma (k)}\right ]^{\frac {(mn)\pi }{10}}\). 
Sample M _{ i } when CDF is \(F_{{\mathcal {Z}}_{i}}(z)\)
(a) Take a too high initial guess of M _{ i } (default: 1000). Initialize variable b to half of this value. 
(a) Randomly select a real number r _{ n }∈[ 0,1]. 
(c) If \(F_{{\mathcal {Z}}_{i}}(M_{i})\) is lower than r _{ n }, M _{ i }=M _{ i }+b, else M _{ i }=M _{ i }−b. Divide b by 2. 
(d) Repeat step (c) until b is below the desired precision (default: 0.000001). 
Parallel workflow
gammaMAXT parallel workflow
(1) Each cluster node \(c=1\dots C\) performs an equitable fraction of the computations of the T _{0,1},…,T _{0,m } 
values from Fig. 1. The n highest values (and corresponding SNP pair indexes) from each node are saved 
into file top_c.txt. 
(2) Upon termination of all computations at the previous step, a cluster node aggregates all top_c.txt files and 
retrieves the overall n highest values (and corresponding SNP pair indexes). Results are saved into topfile.txt. 
(3) Each cluster node reads topfile.txt, initialize a vector V of size n with 0’s and performs an equitable fraction 
of the B permutations of Fig. 1. For each permutation i attributed to node c: 
(a) Generate a random permutation of the trait column. 
(b) Compute T _{ i,1},…,T _{ i,n } and store them in a Permutation _{ i } vector. 
(c) Execute step (3)(c) of the gammaMAXT algorithm to estimate M _{ i }. 
(d) Replace T _{ i,n } by M _{ i } if T _{ i,n }<M _{ i }. 
(e) Force the monotonicity of the Permutation _{ i } vector: for j=n−1,…,1 replace T _{ i,j } by T _{ i,j+1} if T _{ i,j }<T _{ i,j+1}. 
(f) For each j=1,…,n, if T _{ i,j }≥T _{0,j } increment V _{ j } by one. 
Upon completion of all computations on node c, save V into file permut_c.txt. 
(4) A cluster node sums all vectors from the permut_c.txt files to obtain a vector p. All elements of p are 
incremented by 1 and divided by B+1. The monotonicity is forced: for j=1,…,n−1, replace p _{ j+1} by p _{ j } 
if p _{ j+1}<p _{ j }. 
Results and discussion
In this section, we first show results supporting the two assumptions on which the novel algorithm is based. Then, we analyse the performances in terms of computingtime, power and control of the FWER.
Results supporting assumption 1

A simulated dataset D _{1} expressed on a binary scale, composed of 1000 SNPs and 1000 individuals. Table 7 states the twolocus penetrance table used to generate it. A balanced number of cases and controls is sampled. The minor allele frequencies of the functional SNPs are fixed at 0.5 and those of the nonfunctional SNPs are randomly generated from a uniform distribution on [0.05, 0.5]. This corresponds to the first of six purely epistatic models discussed in [15]. Furthermore, any value in the dataset had a 5 % chance to be missing.Table 7
Twolocus penetrance table used to create the simulated datasets D _{1}, D _{2} and D _{3}
b/b
b/B
B/B
a/a
0
0.1
0
a/A
0.1
0
0.1
A/A
0
0.1
0

A simulated dataset D _{2}, with the same properties as D _{1}, except that the trait is expressed on a continuous scale.

A simulated dataset D _{3}, with the same properties as D _{1}, except that the MAF’s are on average lower, i.e. the nonfunctional SNPs were randomly generated from a uniform distribution on [0.05, 0.1].

A reallife dataset D _{4} on Crohn’s disease, for which the trait is expressed on a binary scale [37, 38], reduced to 12471 SNPs and 1687 subjects as in [23].
Results supporting assumption 2
In this section, we show results supporting the hypothesis that parameters π, y _{0}, k and θ are stable across permutations. We perform MBMDR4.2.2 analyses on datasets D _{1} to D _{4}, using the default settings. For this experiment, we modified the gammaMAXT algorithm such that it fits new parameters for each of the 999 permutations (not only once every 20 as previously mentioned) and saves these into a file. We report their means and variances in Table 3. We observe that the variance is very low across all scenarios.
Computingtime of the gammaMAXT algorithm
Execution times of MBMDR4.2.2. The parallel workflow was tested on a 256core computer cluster (Intel L5420 2.5 GHz). The sequential executions were performed on a single core of this cluster
SNPs  MBMDR4.2.2  MBMDR4.2.2  MBMDR4.2.2  MBMDR4.2.2 

Binary trait  Binary trait  Continuous trait  Continuous trait  
sequential execution  parallel workflow  sequential execution  parallel workflow  
10^{3}  13 min 33 sec  20 sec  13 min 18 sec  18 sec 
10^{4}  52 min 15 sec  1 min 05 sec  56 min 14 sec  53 sec 
10^{5}  64 h 35 min  22 min 15 sec  70 h 03 min  20 min 28 sec 
10^{6}  ≈ 270 days  25 h 12 min  ≈ 290 days  24 h 06 min 
FWER of the gammaMAXT algorithm

A set S _{1} of 1000 datasets, each composed of 1000 SNPs and 1000 individuals, containing null data generated randomly from a uniform distribution on [0.05, 0.5]. A balanced number of cases and controls is sampled.

A set S _{2} with the same properties as S _{1}, except that the trait is expressed on a continuous scale.

A set S _{3} of 200 datasets, each composed of 10^{4} SNPs and 1000 individuals, constructed in the same way as S _{1}.

A set S _{4} with the same properties as S _{3}, except that the trait is expressed on a continuous scale.
Observed FWER of MBMDR4.2.2
Set  Amount datasets  Observed FWER 

S _{1}  1000  4.5 % 
S _{2}  1000  6.2 % 
S _{3}  200  7 % 
S _{4}  200  6.5 % 
Power of the gammaMAXT algorithm
Power comparison between the gammaMAXT and the MaxT algorithms
Heritability  gammaMAXT  MaxT 

0.0100  3.7 %  4.2 % 
0.0125  17.9 %  19.4 % 
0.0150  50.3 %  51.5 % 
0.0175  67.0 %  68.7 % 
0.0200  86.6 %  87.9 % 
0.0225  94.3 %  94.7 % 
0.0250  97.5 %  97.8 % 
0.0275  99.2 %  99.3 % 
0.0300  99.6 %  99.6 % 
Conclusion
In this work we introduced gammaMAXT, a novel implementation of the maxT algorithm for multiple testing correction. The algorithm was implemented in software MBMDR4.2.2, as part of the MBMDR framework to screen for SNPSNP, SNPenvironment or SNPSNPenvironment interactions at a genomewide level. In this context, we analyzed a dataset composed of 10^{6} SNPs and 1000 individuals within one day on a 256core computer cluster. The same analysis would take about 10^{4} times longer with Van Lishout’s implementation of maxT, which was already an improvement of the classic Westfall and Young implementation [26]. These results are promising for future GWAIs. However, the proposed gammaMAXT algorithm offers a general significance assessment and multiple testing approach, applicable to any context that requires performing hundreds of thousands of tests. It offers new perspectives for fast and efficient permutationbased significance assessment in largescale (integrated) omics studies.
Availability
MBMDR4.2.2 can be downloaded for free at http://www.statgen.ulg.ac.be.
Declarations
Acknowledgements
This research was in part funded by the Fonds de la Recherche Scientifique (F.N.R.S.), in particular “Integrated complex traits epistasis kit” (Convention 2.4609.11) [FVL, KVS]. We also acknowledge research opportunities offered by F.N.R.S., “Foresting in Integromics Inference” (Convention T.0180.13) [FG, KVS]. In addition, this paper presents research results of the Belgian Network DYSCO (Dynamical Systems, Control, and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office [FVL, FG, LW, KVS]. JHM was funded by National Institutes of Health (USA) grant LM009012. The scientific responsibility rests with the authors.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Shastry BS. Pharmacogenetics and the concept of individualized medicine. Pharmacogenomics J. 2006; 6(1):16–21.PubMedView ArticleGoogle Scholar
 van’t Veer LJ, Bernards R. Enabling personalized cancer medicine through analysis of geneexpression patterns. Nature. 2008; 452(7187):564–70.PubMedView ArticleGoogle Scholar
 Galas DJ, Hood L. Systems biology and emerging technologies will catalyze the transition from reactive medicine to predictive, personalized, preventive and participatory (p4) medicine. Interdisc Bio Central. 2009; 1:1–4.View ArticleGoogle Scholar
 Beevers CG, McGeary JE. Therapygenetics: moving towards personalized psychotherapy treatment. Trends Cogn Sci. 2012; 16(1):11–12.PubMedView ArticleGoogle Scholar
 Lester KJ, Eley TC. Therapygenetics: Using genetic markers to predict response to psychological treatment for mood and anxiety disorders. Biology of mood and anxiety disorders. 2013; 3(1):1–16.View ArticleGoogle Scholar
 Slatkin M. Epigenetic inheritance and the missing heritability problem. Genetics. 2009; 182(3):845–50.PubMedPubMed CentralView ArticleGoogle Scholar
 Eichler EE, Flint J, Gibson G, Kong A, Lean S, Moore JH, et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010; 11(6):446–50.PubMedPubMed CentralView ArticleGoogle Scholar
 Lee SH, Wray NR, Goddard ME, Visscher PM. Estimating missing heritability for disease from genomewide association studies. Am J Hum Genet. 2011; 88(3):294.PubMedPubMed CentralView ArticleGoogle Scholar
 Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc Natl Acad Sci. 2012; 109(4):1193–98.PubMedPubMed CentralView ArticleGoogle Scholar
 Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, et al. Boost: A fast approach to detecting genegene interactions in genomewide casecontrol studies. Am J Hum Genet. 2010; 87:325–40.PubMedPubMed CentralView ArticleGoogle Scholar
 Gyenesei A, Moody J, Semple CA, Haley CS, Wei WH. Highthroughput analysis of epistasis in genomewide association studies with biforce. Bioinformatics. 2012; 19:376–82.Google Scholar
 Hemani G, Theocharidis A, Wei W, Haley C. epigpu: exhaustive pairwise epistasis scans parallelized on consumer level graphics cards. Bioinformatics. 2011; 27:1462–1465.PubMedView ArticleGoogle Scholar
 KamThong T, Czamara D, Tsuda K, Borgwardt K, Lewis C, ErhardtLehmann A, et al. epiblasterfast exhaustive twolocus epistasis detection strategy using graphical pro cessing units. Eur J Hum Genet. 2011; 19:465–71.PubMedView ArticleGoogle Scholar
 KamThong T, Azencott C, Cayton L, Putz B, Altmann A, Karbalai N, et al. Glide: Gpubased linear regression for detection of epistasis. Hum Hered. 2012; 73:220–36.PubMedView ArticleGoogle Scholar
 Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemil. 2003; 24(2):150–7.View ArticleGoogle Scholar
 Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics. 2002; 19(3):376–82.View ArticleGoogle Scholar
 Calle ML, Urrea V, Vellalta G, Malats N, Van Steen K. Improving strategies for detecting genetic patterns of disease susceptibility in association studies. Stat Med. 2008; 27:6532–546.PubMedView ArticleGoogle Scholar
 Cattaert T, Calle ML, Dudek SM, Mahachie John JM, Van Lishout F, Urrea V, et al. Modelbased multifactor dimensionality reduction for detecting epistasis in casecontrol data in the presence of noise. Ann Hum Genet. 2011; 75:78–89.PubMedView ArticleGoogle Scholar
 Gusareva E, Van Steen K. Practical aspects of genomewide association interaction analysis. Hum Genet. 2014; 133(11):1343–58.PubMedView ArticleGoogle Scholar
 Wienbrandt L, Kässens JC, GonzalezDominguez J, Schmidt B, Ellinghaus D, Schimmler M. FPGAbased Acceleration of Detecting Statistical Epistasis in GWAS In: Science PC, editor. 14th International Conference on Computational Science. Elsevier  Procedia Computer Science, vol 29;2014. p. 220–30. http://www.sciencedirect.com/science/article/pii/S1877050914001975.
 Van Steen K. Traveling the world of genegene interactions. Brief Bioinform. 2011; 13(1):1–19.View ArticleGoogle Scholar
 Mahachie John JM, Cattaert T, Van Lishout F, Gusareva E, Van Steen K. Lowerorder effects adjustment in quantitative traits modelbased multifactor dimensionality reduction. PLoS ONE. 2012; 7(1):29594–1013710029594.View ArticleGoogle Scholar
 Van Lishout F, Mahachie John JM, Gusareva ES, Urrea V, Cleynen I, Théâtre E, et al. An efficient algorithm to perform multiple testing in epistasis screening. BMC Bioinforma. 2013;14(138). http://www.biomedcentral.com/14712105/14/138.
 Dunn OJ. Multiple comparisons among means. J Am Stat Assoc. 1961; 56(293):52–64.View ArticleGoogle Scholar
 Ge Y, Dudoit S, Speed TP. Resamplingbased multiple testing for microarray data analysis. Technical Report 633. Berkley: Department of Statistics, University of California; 2003.Google Scholar
 Westfall PH, Young SS. Resamplingbase Multiple Testing. New York: Wiley; 1993.Google Scholar
 Mahachie John JM, Van Lishout F, Van Steen K. Modelbased multifactor dimensionality reduction to detect epistasis for quantitative traits in the presence of errorfree and noisy data. Eur J Hum Genet. 2011; 19(6):696–703.PubMedPubMed CentralView ArticleGoogle Scholar
 Calle ML, Urrea V, Malats N, Van Steen K. Mbmdr: modelbased multifactor dimensionality reduction for detecting interactions in highdimensional genomic data. Technical Report 24. 2008.Google Scholar
 Mahachie John JM. Genomic association screening methodology for highdimensional and complex data structures: Detecting norder interactions. 2012. http://orbi.ulg.ac.be/handle/2268/136086.
 Kotz S, Balakrishnan N, Johnson N. Continuous Multivariate Distributions, Models and Applications: Wiley; 2000.Google Scholar
 Hautsch N, Malec P, Schienle M. Capturing the zero: A new class of zero augmented distributions and multiplicative error processes. J Financ Econ. 2013; 12(1):89.Google Scholar
 Bickel P, Doksum K. Mathematical Statistics, Basic Ideas and Selected Topics: PrenticeHall, Inc; 1977.Google Scholar
 Allenby GM, Leone RP, Jen LC. A dynamic model of purchase timing with application to direct marketing. J Am Stat Assoc. 1999; 94:365–74.View ArticleGoogle Scholar
 Pattin KA, White BC, Barney N, Gui J, Nelson HH, Kelsey KT, et al. A computationally efficient hypothesis testing method for epistasis analysis using multifactor dimensionality reduction. Genet Epidemiol. 2009; 33(1):87–94.PubMedPubMed CentralView ArticleGoogle Scholar
 Minka TP. Estimating a gamma distribution. 2002. http://research.microsoft.com/enus/um/people/minka/papers/minkagamma.pdf.
 Choi SC, Wette R. Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics. 1969; 11(4):683–90.View ArticleGoogle Scholar
 Libioulle C, Louis E, Hansoul S, Sandor C, Farnir F, Franchimont D, et al. Novel crohn disease locus identified by genomewide association maps to a gene desert on 5p13.1 and modulates expression of ptger4. Plos Genetics. 2007; 3(4):58.View ArticleGoogle Scholar
 Barett JC, Hansoul S, Nicolae DL, Cho JH, Duerr RH, Rioux JD, et al. Genomewide association defines more than 30 distinct susceptibility loci for crohn’s disease. Nat Genet. 2008; 40(8):955–62.View ArticleGoogle Scholar
 Urbanowicz RJ, Kiralis J, SinnottArmstrong NA, Heberling T, Fisher JM, Moore JH. Gametes: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Mining. 2012; 5(1):16. http://www.ncbi.nlm.nih.gov/pubmed/23025260.PubMedPubMed CentralView ArticleGoogle Scholar
 Bradley J. Robustness?Br J Math Stat Psychol. 1978; 31:144–52.View ArticleGoogle Scholar