A classification and characterization of twolocus, pure, strict, epistatic models for simulation and detection
 Ryan J Urbanowicz^{1},
 Ambrose LS GranizoMackenzie^{1},
 Jeff Kiralis^{1} and
 Jason H Moore^{1}Email author
DOI: 10.1186/1756038178
© Urbanowicz et al.; licensee BioMed Central Ltd. 2014
Received: 5 November 2013
Accepted: 23 May 2014
Published: 9 June 2014
Abstract
Background
The statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multilocus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, twolocus epistatic models, (2) Investigate the effect of model ‘architecture’ on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.
Results
In this study we utilized a geometric approach to classify pure, strict, twolocus epistatic models by “shape”. In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.
Conclusions
These analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.
Keywords
Epistasis Models Simulation Genetics GAMETES Computational geometry Convex hullBackground
The phenomenon of epistasis, or genegene interaction, confounds the statistical search for main effects, i.e. single locus associations with phenotype [1]. The term epistasis was coined to describe a genetic ‘masking’ effect viewed as a multilocus extension of the dominance phenomenon, where a variant at one locus prevents the variant at another locus from manifesting its effect [2]. In the context of statistical genetics, epistasis is traditionally defined as a deviation from additivity in a mathematical model summarizing the relationship between multilocus genotypes and phenotypic variation in a population [3]. Alternate definitions and further discussion of epistasis is given in [1, 4–9].
Limited by time and technology, and drawn by the appeal of “low hanging fruit”, it has been typical for genetic studies to focus on single locus associations (i.e. main effects). Unfortunately, for those common diseases typically regarded as complex (i.e. involving more than a single loci in the determination of phenotype) this approach has yielded limited success [10, 11]. The last decade has seen a gradual acknowledgment of disease complexity and greater focus on strategies for the detection of complex disease associations within clinical data [1, 12–14]. Beyond the detection of complex multilocus genetic models, theoretical investigations have also pursued their enumeration, generation, and classification. These theoretical works seek to lay the foundation for the identification and interpretation of multilocus associations as they may appear in genetic studies.
A natural stepping stone towards understanding complex multilocus effects is the examination of twolocus models. Early on, Neuman and Rice [15] considered epistatic twolocus disease models for the explanation of complex illness inheritance, highlighting the importance of looking beyond a single locus. Li and Reich [16] classified all 512 fully penetrant twolocus models, in which genotype disease probabilities (i.e. penetrances) were restricted to zero and one. This work emphasized diversity of complex models beyond the typical twolocus models previously considered by linkage studies. Of these models, only a couple exhibit what was later referred to as “purely” epistatic interactions. Pure refers to epistasis between n loci that do not display any main effects [13, 17–20]. Alternatively, impure epistasis implies that one or more of the interacting loci have a main effect contributing to disease status [19, 20]. Hallgrimsdottir and Yuster [21] later expanded this twolocus characterization to include models with continuous penetrance values. Within a population of randomly generated twolocus models, they characterized 69 “shapebased” classes of impure epistatic models. In addition, they observed that the “shape” of a model (1) reveals information about the type of gene interaction present, and (2) impacts the power (i.e. frequency of success) in detecting the underlying epistasis.
Taking aim at pure epistasis, Culverhouse et. al.[18] described the generation of two to fourlocus purely epistatic models and explored the limits of their detection. Working with a precisely defined class of models such as pure epistasis offered a more mathematically tractable set for generation and investigation. The value of their work was not to suggest that purely epistatic models necessarily reflect real genetic interactions, but rather the ability extrapolate their findings to more likely epistatic models possessing small main effects.
Similar to these earlier works, the present study focus on statistical epistasis, which is the phenomenon as it would be observed in casecontrol association studies, quantitative trait loci (QTL) mapping, or linkage analysis. Exclusively, we focus on a precise subclass of epistasis which we refer to as pure and strict. Strict, conceptually alluded to in [18], refers to epistasis where n loci are predictive of phenotype but no proper multilocus subset of them are [19, 20]. Of note, all twolocus purely epistatic models are strict by default since no other subsets are possible with only twoloci. The loci in pure, strict models could be viewed as “fully masked” in that no predictive information is gained until all n loci are considered in concert. Therefore these models may be considered “worst case” in terms of detection difficulty. While this exact, extreme class of models is unlikely to be pervasive within real biological associations, they offer a gold standard for evaluating and comparing strategies for the detection and modeling of multiple predictive loci.
A handful of studies have introduced methods for generating epistatic models [18, 22–24] including our own Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES) [19] designed to randomly generate an architecturally diverse population of pure, strict, epistatic models. Architecture references the unique composition of a model (e.g. the particular penetrance values and arrangement of those values across genotypes). Additionally, in [20] an Ease of Detection Measure (EDM) was introduced and incorporated into GAMETES offering a predictor of model detection difficulty calculated directly from the penetrance values and genotype frequencies of a given genetic model. Previously we demonstrated that a 2locus model’s EDM was more strongly and significantly correlated with the detection power than heritability or any other metric considered. Detection power was determined separately using three very different, cutting edge data search algorithms in order to establish EDM calculation as a simple alternative to completing model detection power analyses.
In the present study we refine the characterization of twolocus models described in [16] and [21] to a more specific subset of models defined as having pure, strict, epistasis. We generate these models using GAMETES [19, 20] and apply the geometric approached used in [21] to similarly identify shape model classes. Next, we examine whether model EDM scores (a surrogate measure of detection difficulty) differs between these shape groups as well as between groups with the same number of edges in their projected shapes. Then, we evaluate the impact of GAMETES model population as well as the effect of fixing population prevalence (K) or allowing it to vary randomly on observed model shape coverage and EDM score range. This study expands our theoretical understanding of a particularly challenging class of multilocus models and suggests novel insight into the effective generation of complex models with GAMETES.
Methods
In this section, we describe (1) the modeling of epistasis with GAMETES, (2) the triangulation of model shape (3) our experimental evaluation.
Modeling 2Locus pure strict epistasis
Single nucleotide polymorphisms or (SNPs) are loci in the DNA sequence which can serve as markers of phenotypic variation. The term genotype has been used to refer both to the allele states of a single SNP, as well as the combined allele states of multiple SNPs. Herein, we will refer to the latter as a multilocus genotype (MLG) whenever necessary.
Penetrance functions represent one approach to modeling the relationship between genetic variation and a dichotomous trait. Penetrance is the probability of disease, given a particular genotype or MLG. Our models assume Hardy Weinberg equilibrium such that, the allele frequencies for a SNP may be used to calculate it’s genotype frequencies as follows; freq(AA) = p ^{2}, freq(Aa) = 2pq, and freq(aa) = q ^{2}, where p is the frequency of the major (more common) allele ‘A’, q is the minor allele frequency (MAF) where ‘a’ is the minor allele, and p + q = 1. Penetrance functions may be constructed to describe nlocus interactions between n predictive loci using a penetrance function comprised of 3^{ n } penetrance values corresponding to each of the 3^{ n } MLGs.
A 2locus purely epistatic penetrance function
SNP 2  Marginal  

Genotype  BB (.25)  Bb (.5)  bb (.25)  Penetrance  
AA (.36)  .266  .764  .664  .614  
SNP 1  Aa (.48)  .928  .398  .733  .614 
aa (.16)  .456  .927  .147  .614  
Marginal  .614  .614  .614  K =.614  
Penetrance 
Classic, fully penetrant 2locus model of pure epistasis
SNP 2  Marginal  

Genotype  BB (.25)  Bb (.5)  bb (.25)  Penetrance  
AA (.25)  0  1  0  .5  
SNP 1  Aa (.5)  1  0  1  .5 
aa (.25)  0  1  0  .5  
Marginal  .5  .5  .5  K =.5  
Penetrance 
The GAMETES strategy for generating random, nlocus, pure, strict epistatic models is briefly reviewed here. Each nlocus model is generated deterministically, based on a set of pseudo random parameters, a randomly selected direction, and specified values of heritability, MAFs, and population prevalence (K). The GAMETES algorithm first (1) generates 2^{ n } random parameters and a random unit vector in ${\mathbb{R}}^{{2}^{n}}\phantom{\rule{0.3em}{0ex}}$, then (2) generates a random prepenetrance function by seeding these parameters using the unit vector, and then (3) uses a scaling function to scale the entries of this random prepenetrance function to generate a random penetrance function. To obtain a random penetrance function having a specified heritability, or heritability and K, it further (4) scales the entries of this penetrance function to achieve, if possible, these values. If steps (1) or (4) are not successful the algorithm starts over, attempting to generate models until either the desired model population size or the iteration limit is reached. For a detailed explanation of this strategy see [19].
EDM is utilized by GAMETES to select model architectures that span the range of predicted difficulties [20]. This allows for the design of a simulation study which diversifies model architecture based on detection difficulty. First we generated a population of pure, strict, epistatic models of random architecture sharing commonly specified genetic constraints (i.e. number of loci, heritability, MAFs, and K). GAMETES allows the user to specify a population size of models from which some will be selected to generate simulated genetic datasets. Certain constraint combinations may yield few or no viable models [19]. Therefore, GAMETES runs until either the desired population size or a maximum attempt limit is reached. Once one of the aforementioned stopping criteria is met, all models (each with the same constraints) were ordered by their EDM. At this point, GAMETES select some number of models to represent the range of observed EDMs. By default, GAMETES selects two models from this distribution, representing the highest and lowest EDM scores. A higher EDM indicates that a given model will be easier to detect than a model with a lower EDM. For the purposes of this study, we directed GAMETES to instead report the entire population of models generated by GAMETES.
Shapes of twolocus models
The triangulation, or shape, of a model is used here to generalize it’s architecture and offer a classification of the type of interaction present. This geometric classification of epistasis was first applied to haploid models in [25], and extended to diploid twolocus QTL models in [21]. Overall, our approach was similar to [21], except that we used Qhull [26] as opposed to TOPCOM [27] to compute triangulations of the models.
As in [16] and [21] we take symmetry into account when defining shape classes. Symmetry is determined by (1) interchanging locus 1 and locus 2, or (2) interchanging two alleles at one or both loci. In [21], shape classes were further characterized by circuits (i.e. linear combinations of penetrance values) which decompose the main and epistatic effects of a model. In the present study, all models being classified are purely epistatic, having no main effects to decompose. Circuits could be used to decompose the types of interaction effects characterizing a model however this is beyond the scope of the present study.
Experimental evaluation
We use GAMETES to generate differently sized populations of pure, strict, twolocus epistatic models possessing different constraint combinations. Specifically, populations were generated for heritabilities of 0.005, 0.01, 0.025, 0.05, 0.1, or 0.2, MAFs of 0.2 or 0.4 and with population prevalence (K) either fixed to 0.3 or allowed to vary to any value between 0 and 1. Thus, a total of 24 constraint combinations were considered (6 heritabilities, ∗ 2 MAFs ∗ 2 prevalence settings). Heritability and MAF constraints were seletect to be consistent with previous work using GAMETES [19, 20], and the K value of 0.3 was selected based on the limits described in [19] to ensure that the specified combinations of heritability and MAF would yield models. We explore a variable K since a specific population prevalence rarely of interest in simulation studies, and previous findings in [19] indicated that a variable K facilitated viable model discovery in GAMETES.
For each constraint combination above, GAMETES was used to generate a population of models of sizes 1,000, 10,000, and 100,000 yielding a total of 72 different populations of models. All together, 2,664,000 models were generated which is similar in magnitude to the 1,000,000 random models examined in [21]. Within each of these 72 populations we characterize all model shapes as previously described. Additionally, we further generalize model shape by categorizing models by the number of edges as well as the number of polygons (triangles) existing within it’s shape class.
Observations in [21] suggested that the power to detect randomly generated, impure epistatic models was correlated with model shape. Extending these findings, we examine how model detection difficulty differs between shape classes observed in populations of pure, strict, epistatic models. We utilize EDM as a surrogate for detection difficulty or power, where power is used to describe the frequency of successful detection of a model. EDM is calculated directly from the penetrance function circumventing the need to generate simulated datasets and perform a secondary evaluation of power. The nonparametric KruskalWallis test [28] was used to evaluate whether model EDM significantly differed within separate shape classes, as well as between groups defined by the number of edges in the model projection. MannWhitney pairwise comparisons were subsequently utilized to look for EDM differences between models with a specific number of projection edges.
Results and discussion
Edge numbers in shape classes
Number of edges  Associated class ID’s 

1  4 
2  6,9 
3  1,2,3,14 
4  5,10,11,13,16,18,21,25 
5  7,8,12,15,17,19,20,22,23,24 
6  26,27,28,29,30,31,32,33 
Similar figures for populations of all three sizes and a variable K (instead of a fixed K) are given in the Additional file 1 (Figures S2, S3 and, S4). As for the fixed K populations, within the variable K populations KruskalWallis testing confirmed that the EDMs of models found in different shape groups were significantly different (P <<0.001). However we observed one key difference when allowing K to vary. Specifically, the models generated were more evenly distributed across different shape classes, which simultaneously improved class coverage such that more shape classes were represented within each of the 12 constraint combination populations with variable K than when compared to fixed K. This finding is intuitive since allowing K to vary is less mathematically restrictive for penetrance function generation, and thus we would expect diversity.
Conclusions
This study pursued three goals: (1) Classify and characterize pure, strict 2locus epistatic models, (2) explore the relationship between model architecture and detection difficulty in the generalized context of model shape, and (3) explore the maintenance of model architecture diversity in GAMETESgenerated model populations to establish guidelines for effective complex model generation. Our focus on such a precise, challenging class of epistatic models lends itself to both simulation studies, in which a gold standard for algorithmic evaluation is desirable, and to real world model detection where our characterization of a more mathematically tractable class of epistasis may facilitate the characterization of interaction in an observed biological model.
Our geometric classification of pure, strict, twolocus epistasis model shapes revealed 33 unique shape symmetry classes having 1 to 6 edges. This is in contrast to the 69 symmetry classes having 1 to 8 edges identified in randomly generated impure epistatic models [21]. The shape of a twolocus model may be used to classify and offer a visual representation of the type of gene interaction present. This classification of model shape might be applied to epistatic models identified in real world analyses in order to determine associated shape classes and EDM scores. It would be interesting to explore whether the model shapes of real world interactions tended to correspond with shapes that GAMETES generates more frequently by chance.
Having previously demonstrated the ability of EDM to predict detection power in [20], the present evaluation of model EDM transitively indicates that the shape of pure, strict epistatic models significantly influences detection difficulty. This is also in line with the claim made by [21] that an epistatic model’s shape impacts the power to detect it. Additionally, this finding highlights the importance of taking model architecture into consideration when generating models and datasets for the evaluation of algorithms. Further characterization of shape classes, grouped by number of edges in the shape projection, reveals that the number of edges alone is also predictive of relative detection difficulty.
The experimental variation of population size and K revealed the overall consistency of the GAMETES model generation strategy. GAMETES was designed to generate a random population of genetic models with diverse model architectures. We gauge this diversity on (1) the range of model EDMs (maximum and minimum observed EDM values), and (2) the number of shape classes observed in the generated population of models. Maintaining model diversity is ideal when constructing a simulation study which efficiently covers the space in which real biological disease associations could appear. Our results suggest that when generating models for a simulation study GAMETES effectively maintains diversity even at population size of only 1,000. However for maximal architectural diversity, a population size of 100,000 combined with a variable K is optimal. Combining GAMETES with geometric model classification could be used to select models for a simulation study which are both representative of each model class and representative of maximum and minimum EDM values. By developing better simulation study designs we hope to encourage the development of better complex model detection algorithms.
As suggested by [21], the geometric classification scheme applied in this paper could be expanded to three or more loci, as well as to QTL studies (by scaling penetrance values outside the range 01, such that values at MLGs become expected phenotype magnitudes). It is crucial not only to develop techniques for the detection of complex multilocus interactions, but to develop a theoretical understanding of epistasis. This work will promote effective simulation studies and facilitate the characterization of observed realworld interactions by better defining the charteristics of one group of complex epistatic models to which others may be compared.
Abbreviations
 SNP:

Single nucleotide polymorphism
 GAMETES:

Genetic architecture model emulator for testing and evaluating software
 MLG:

Multilocus genotype
 MAF:

Minor allele frequency
 K:

Population prevalence
 EDM:

Ease of detection measure.
Declarations
Acknowledgements
This work was supported by NIH grants AI59694, LM009012, LM010098, EY022300, LM011360, CA134286, and GM103534.
Authors’ Affiliations
References
 Cordell H:Epistasis: what it means, what it doesn’t mean, and statistical methods to detect it in humans. Hum Mol Genet. 2002, 11 (20): 2463View ArticlePubMedGoogle Scholar
 Bateson W, Mendel G:Mendel’s principles of heredity. Putnam’s. 1909,Google Scholar
 Fisher R:The correlation between relatives on the supposition of mendelian inheritance. Trans R Soc Edinburgh. 1918, 52: 399433.View ArticleGoogle Scholar
 Cheverud J, Routman E:Epistasis and its contribution to genetic variance components. Genetics. 1995, 139 (3): 1455PubMedPubMed CentralGoogle Scholar
 Frankel W, Schork N:Who’s afraid of epistasis?. Nat Genet. 1996, 14 (4): 371373.View ArticlePubMedGoogle Scholar
 Phillips P:The language of gene interaction. Genetics. 1998, 149 (3): 1167PubMedPubMed CentralGoogle Scholar
 Wade M, Winther R, Agrawal A, Goodnight C:Alternative definitions of epistasis: dependence and interaction. Trends Ecol & Evol. 2001, 16 (9): 498504.View ArticleGoogle Scholar
 Moore J, Williams S:Traversing the conceptual divide between biological and statistical epistasis: systems biology and a more modern synthesis. Bioessays. 2005, 27 (6): 637646.View ArticlePubMedGoogle Scholar
 Moore J, Williams S:Epistasis and its implications for personal genetics. Am J Hum Genet. 2009, 85 (3): 309320.View ArticlePubMedPubMed CentralGoogle Scholar
 Shriner D, Vaughan L, Padilla M, Tiwari H:Problems with genomewide association studies. Science. 2007, 316 (5833): 1840cView ArticleGoogle Scholar
 Eichler E, Flint J, Gibson G, Kong A, Leal S, Moore J, Nadeau J:Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010, 11 (6): 446450.View ArticlePubMedPubMed CentralGoogle Scholar
 McKinney B, Reif D, Ritchie M, Moore J:Machine learning for detecting genegene interactions: a review. Appl Bioinform. 2006, 5 (2): 7788.View ArticleGoogle Scholar
 Cordell H:Detecting gene–gene interactions that underlie human diseases. Nat Rev Genet. 2009, 10 (6): 392404.View ArticlePubMedPubMed CentralGoogle Scholar
 Moore J, Asselbergs F, Williams S:Bioinformatics challenges for genomewide association studies. Bioinformatics. 2010, 26 (4): 445View ArticlePubMedPubMed CentralGoogle Scholar
 Neuman R, Rice J:Twolocus models of disease. Genet Epidemiol. 1992, 9 (5): 347365.View ArticlePubMedGoogle Scholar
 Li W, Reich J:A complete enumeration and classification of twolocus disease models. Hum Hered. 2000, 50 (6): 334349.View ArticlePubMedGoogle Scholar
 Brodie III E:Why evolutionary genetics does not always add up.Epistasis Evol Process. 2000, New York: Oxford University Press, 319.Google Scholar
 Culverhouse R, Suarez B, Lin J, Reich T:A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet. 2002, 70 (2): 461471.View ArticlePubMedPubMed CentralGoogle Scholar
 Urbanowicz RJ, Kiralis J, SinnottArmstrong NA, Heberling T, Fisher JM, Moore JH:GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData Min. 2012, 5: 114.View ArticleGoogle Scholar
 Urbanowicz RJ, Kiralis J, Fisher JM, Moore JH:Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection. BioData Min. 2012, 5: 113.View ArticleGoogle Scholar
 Hallgrímsdóttir IB, Yuster DS:A complete classification of epistatic twolocus models. BMC Genetics. 2008, 9: 17View ArticlePubMedPubMed CentralGoogle Scholar
 Moore J, Hahn L, Ritchie M, Thornton T, White B:Application of genetic algorithms to the discovery of complex models for simulation studies in human genetics. Proceedings of the Genetic and Evolutionary Computation Conference. 2002, NIH Public Access, 11551155.Google Scholar
 Moore J, Hahn L, Ritchie M, Thornton T, White B:Routine discovery of complex genetic models using genetic algorithms. Appl Soft Comput. 2004, 4: 7986.View ArticlePubMedPubMed CentralGoogle Scholar
 Greene C, Himmelstein D, Moore J:A model free method to generate human genetics datasets with complex genedisease relationships. Evol Comput Mach Learn Data Min Bioinformatics. 2010, 6023: 7485.View ArticleGoogle Scholar
 Beerenwinkel N, Pachter L, Sturmfels B:Epistasis and shapes of fitness landscapes. Stat Sinica. 2007, 17: 13171342.Google Scholar
 Barber C, Huhdanpaa H:Qhull, Softwarepackage. 1995,Google Scholar
 Rambau J:TOPCOM: Triangulations of point configurations and oriented matroids. Mathematical software: proceedings of the first International Congress of Mathematical Software: Beijing, China, 1719 August 2002. 2002, Imperial College Pr, 330340.Google Scholar
 Kruskal W, Wallis W:Use of ranks in onecriterion variance analysis. J Am Stat Assoc. 1952, 47 (260): 583621.View ArticleGoogle Scholar
 Hahn LW, Ritchie MD, Moore JH:Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics. 2003, 19 (3): 376382.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.