Proposed data mining methods for epistasis detection are seldom thoroughly discussed in terms of their underlying (model) assumptions and their effects on overall power or type I error control. For instance, another well-known data dimensionality reduction method for quantitative traits (generalized MDR - GMDR) [35] is based on score statistics to define differential multilocus genotype groups related to the trait of interest. Although the GMDR method is not necessarily likelihood-based (least-squares regression or other statistical methods for non-normal continuous traits can be employed as well, in theory), continuous phenotypes were only investigated in terms of a normal model, and the software implementation for continuous traits relies on the classical linear regression paradigm to build the score statistics. The authors did not explicitly investigate the power of their method when non-normal continuous data are involved in the context of epistasis screening. Previously, we commented on the advantages and disadvantages of GMDR-like methods compared to MB-MDR (e.g., [5, 33]). Based on these comments, we here focused on MB-MDR while investigating the effects of model-violations on the performance of 2-locus multifactor dimensionality reduction methods for quantitative traits.

For every 2 loci (for 2 bi-allelic SNPs, there are theoretically 9 multilocus genotype combinations), MB-MDR with association *t*-tests subsequently creates two groups, where one group refers to one multilocus genotype and the other to the remaining multilocus genotype combinations. Internally, 2-group comparison tests are performed so as to assign a “label” to each multilocus genotype. This procedure naturally creates highly imbalanced groups, with potentially extreme cases of heteroscedasticity. Although Welch’s test is designed to give a valid *t*-test in the presence of different population variances, Welch’s *t*-test combined with MB-MDR shows no improved power over the Student’s *t*-test for scenarios with unequal variances, even for normally distributed traits (cfr Table 3). This can be explained by the fact that the degrees of freedom for the Welch’s test become smaller for strongly unequal groups, resulting in a highly conservative test in the event of extreme unbalanced data (see e.g., [36] and Figure 1). This motivates our choice to continue working with MB-MDR analyses based on Student’s *t* testing to identify groups of multilocus genotypes with differential trait values, despite the underlying trait distribution.

It is well-known that parametric methods have improved statistical power over non-parametric methods when all parametric model assumptions are valid [37, 38]. When an analysis of residuals detects violations of assumptions of normality and heterogeneity of variance of errors across groups for ANOVA, remedial measures that log-transform the dependent variable and consideration of an ANOVA model assuming unequal variances, may work well. However, in screening settings involving many factors at a time, it is usually impractical to find a single transformation that is universally optimal for all factors. When study data do not meet the distributional assumptions of parametric methods, even after transformation, or when data involve non-interval scale measurements, a non-parametric context is more appropriate. Such a context usually implies testing based on ranks or applying data rank transformations prior to the application of a parametric test.

Strong power increases were observed when data were rank-transformed prior to MB-MDR testing with Student’s *t* association testing. This can be understood by noting that the ranks, which are computed for the pooled set of all available quantitative trait measurements, in general reduces the influence of skewness and deviations from normality in the population distribution [39, 40]. The same is achieved by a percentile transformation (Rtn), which – at the same time - preserves the relative magnitude of scores between groups as well as within groups. Only for normally distributed data with equal variances, the ideal scenario for a *t*-test on original traits, a small power loss is observed. Goh and Yap [40] also concluded that rank-based transformation tends to improve power regardless of the distribution. In general, as with traditional two group *t*-testing, deviations from normality seem to be more influential to the power of an MB-MDR analysis with Student’s *t* than deviations from homoscedasticity (Table 3). This is also in line with the observation that power estimates generally become more optimal for scenarios in which data are transformed to normality prior to MB-MDR analysis compared to scenarios in which they are not. The identical results obtained for untransformed traits and standardized traits (not shown) are not surprising as well. Standardization involves linearly transforming original trait values using the overall trait mean and overall standard deviation. Such a transformation does not affect the two-group *t*-tests within MB-MDR.

Although data transformations are valuable tools, with several benefits, care has to be taken when interpreting results based on transformed data. The inference of epistasis depends upon the scale of measurement in a way that interaction effects can be reduced or eliminated by non-linear monotonic transformations of a dependent variable [41], so also by some rank-based transformations. However, for our simulation scenarios, we have not seen any evidence of such a reduction, nor increase in interaction signals when using rank-transformed data prior to MB-MDR analysis (Tables 1, 2 and 3, Rank). Application of any epistasis screening tool to real-life data will face the challenge to match observed statistical significance with biological relevance [1].

Clearly, sample size matters. The smaller the sample size, the more likely it is to obtain extremely sparse multilocus genotype combinations. By design of MB-MDR, highly inflated type I errors for group comparison tests are expected within MB-MDR, each of which contributing to the final MB-MDR results (Figures 3, 4 and Additional file 2: Figure S2). Despite these internal inflations, there is no evidence for a cumulative or combined effect on MB-MDR’s final results (Tables 1 and 2), irrespective of the assumed model violation (in terms of deviations from normality or homoscedasticity). This can be explained by the permutation-based step-down maxT approach, which is currently adopted by MB-MDR to correct for multiple testing of SNP pairs.

In many of our practical applications though, we observed a tendency of increased numbers of significant epistasis results with MB-MDR applied to quantitative traits, even after SNP pruning (r^{2} below 75%) to avoid potential false positives (or redundant interactions) due to highly correlated SNPs. No such observation was previously made for dichotomous traits. For dichotomous traits, MB-MDR uses a score test, in particular, the Pearson’s chi-squared test. This test is known to be affected by unbalanced data, or sparse data, as is the case for rare variants [42]. However, these data artifacts, which question the use of large sample distributions for test statistics, are minimized by requiring a threshold sample size for a multilocus genotype combination. An explanation for the discrepancies observed between theoretical results and practical applications may be found in the way the null distribution for multiple testing is derived. It is often forgotten that also permutation-based multiple testing corrective procedures make some assumptions. For instance, for the step-down maxT approach as implemented in MB-MDR, the Family-Wise Error Rate (FWER) is strongly controlled provided the assumption of subset pivotality holds [32]. The subset pivotality assumption is needed to ensure that control under a data generating distribution satisfying the complete null gives the desired control under the true data generating distribution, which may harbor any number of true nulls [43].

In real-life applications, we do not know *a priori* which nulls are true and which are false. In addition, preliminary results on the effect of linkage disequilibrium on MB-MDR error control, as well as on the effect of highly variable minor allele frequencies (and thus highly variable available samples sizes for multilocus genotypes) show that subset pivotality is likely to be violated in real-life settings, giving rise to inflated error rates in the presence of multiple epistasis signals.(work in progress). Note that the standard bootstrap method provides the asymptotically correct null distribution for multiple testing and does not require the subset pivotality condition given in Westfall and Young [32]. The investigation of resampling-based multiple testing with asymptotic strong control of type I error as corrective method for multiple testing in MB-MDR warrants further investigation.

Scale transformations are quite common as remedial strategies to meet statistical testing assumptions. However, since the optimal scale transformation is often based on theoretical motivations or statistical convenience, it often leads to new constructs that are hard to interpret or are biologically meaningless. Another concern related to implementing scale transformations is that non-additive signals may be removed as a direct consequence of such transformations prior to analysis [44].

Our results confirmed that rank-based transformations are generally most powerful when quantitative traits are non-normally distributed. Rank transformations serve as a bridge between non-parametrics and parametrics [19]. They naturally eliminate any problem of skewness (e.g. chi-squared distribution). By ranking the impact of outliers is minimized: regardless of how extreme the most extreme observation is, the same rank is given to it. A particular type of rank transformation uses percentile ranks and is referred to as rank transformation to normality. In this context, a percentile rank is defined as the proportion of quantitative trait outcomes in a distribution that a specific trait value is greater than or equal to. When the number of ties is negligible, it will lead to a near to perfect normal distribution, irrespective of the original trait’s distribution, which usually is a highly desirable property.