This article has Open Peer Review reports available.
Evolutionary triangulation: informing genetic association studies with evolutionary evidence
© Huang et al. 2016
Received: 18 June 2015
Accepted: 29 March 2016
Published: 2 April 2016
Genetic studies of human diseases have identified many variants associated with pathogenesis and severity. However, most studies have used only statistical association to assess putative relationships to disease, and ignored other factors for evaluation. For example, evolution is a factor that has shaped disease risk, changing allele frequencies as human populations migrated into and inhabited new environments. Since many common variants differ among populations in frequency, as does disease prevalence, we hypothesized that patterns of disease and population structure, taken together, will inform association studies. Thus, the population distributions of allelic risk variants should reflect the distributions of their associated diseases. Evolutionary Triangulation (ET) exploits this evolutionary differentiation by comparing population structure among three populations with variable patterns of disease prevalence. By selecting populations based on patterns where two have similar rates of disease that differ substantially from a third, we performed a proof of principle analysis for this method. We examined three disease phenotypes, lactase persistence, melanoma, and Type 2 diabetes mellitus. We show that for lactase persistence, a phenotype with a simple genetic architecture, ET identifies the key gene, lactase. For melanoma, ET identifies several genes associated with this disease and/or phenotypes related to it, such as skin color genes. ET was less obviously successful for Type 2 diabetes mellitus, perhaps because of the small effect sizes in known risk loci and recent environmental changes that have altered disease risk. Alternatively, ET may have revealed new genes involved in conferring disease risk for diabetes that did not meet nominal GWAS significance thresholds. We also compared ET to another method used to filter for phenotype associated genes, population branch statistic (PBS), and show that ET performs better in identifying genes known to associate with diseases appropriately distributed among populations. Our results indicate that ET can filter association results to improve our ability to discover disease loci.
As humans moved out of Africa, population patterns of genetic variation changed dramatically, due to both random (e.g., genetic drift that can cause serial founder events) and non-random (environment-specific selection) processes. The histories of the different populations have therefore resulted in substantial differences among populations in allele frequencies at loci throughout the genome . Many, if not most of these differences, will have limited effect on phenotypic variation, but some allelic substitutions may have implications for differences in phenotypic variation among populations that may help inform us in the search for genes of biomedical significance. Specifically, it could be hypothesized that the alleles that affect disease risk in multiple populations should be distributed in ways that are consistent with differences in population prevalence, such that risk alleles should be more frequent in those populations where a given disease is more prevalent and less frequent where the disease is relatively rare.
In this manuscript, we present a heuristic approach that incorporates genetic and epidemiological information – namely, known disease distributions with pairwise allele frequency differences among three populations – to filter association results from other genetic epidemiological analyses, such as genome wide association studies (GWAS). We call this approach Evolutionary Triangulation (ET), as it represents comparisons of three genetically distinct populations, assayed simultaneously. To assess levels of differentiation, we use Wright’s FST , which is a metric that provides an estimate of the level of allele frequency differences among populations. To assess the performance and to determine the limits of our approach, we applied ET to a number of phenotypes that vary in presumed genetic architecture, ranging from an essentially Mendelian trait (lactase persistence), to diseases of increasing complexity, i.e., melanoma, representing an oligogenic disease, and fasting glucose/Type 2 diabetes mellitus as a highly complex disease with many genes of small effect.
Index phenotype and population selection
Genetic associations of the 85th/15th threshold ET genes with index diseases/traits (CEU-GIH-YRI)
Diabetes Mellitus, Type 2/Glucose Intolerance/Insulin Resistancec
Triangulation for ET SNPs and mapping to ET genes
We chose representative HapMap populations based on the prevalences of the index phenotypes (Additional file 1: Table S1). We obtained all SNP allele frequency data for unrelated individuals from the selected populations in the International HapMap Project Phase III , which included 113 Utah residents with Northern and Western European ethnicity (CEU), 84 Han Chinese from Beijing, China (CHB), 88 Tuscans from Italy (TSI), 88 Gujarati Indians from Houston (GIH), 113 Yoruba from Ibadan, Nigeria (YRI), 86 Japanese in Tokyo, Japan (JPT), and 50 Mexican Americans in Los Angeles, California (MEX). CEU was a proxy for Northern European populations, CHB and JPT for East Asian populations, TSI for Southern European populations, GIH for South Asian populations, MEX for Central American Hispanic populations, and YRI for West African populations.
Genetic association mining and identification of additional phenotypes
Genetic associations of ET genes with diseases were analyzed based on data retrieved from the HuGE Navigator  (http://www.cdc.gov/genomics/hugenet/hugenavigator.htm), a human genome epidemiology knowledgebase that incorporates information from PubMed abstracts as its core data source. We defined a “true” association as one for which a gene had: (1) five or more HuGE publications significantly associating it with a phenotype using the Genopedia function and p < 0.05 in any combination of studied populations, or (2) GWAS evidence with a P value of less than 5 × 10−8 reported in HuGE Navigator. Although negative GWAS results have been shown to correlate with geography and FST , we chose to be conservative, ignoring negative results as this is a proof of principle.
Additional phenotypes were noted if they too associated with ET genes in the HuGE Navigator database. Other associating diseases were then assessed for their global distributions to assess whether they are also similarly distributed to the index phenotypes (Additional file 1: Table S1).
Random resampling analysis
To determine whether ET significantly enriched for genes that associated with the index or other appropriately distributed phenotypes, we performed permutation testing. Specifically, for each ET threshold we sampled from the genome the same number of genes that were identified to be associated with appropriately distributed disease using ET. We then verified using the same criteria as for ET how many of these genes were associated with appropriately distributed disease based on the continental ancestry for a given comparison. We used the empirical distribution to determine the p- value of the ET analyses. The empirical distributions were determined by sampling the genome 10,000 times without replacement. We established the number of times that the same number or more such genes were found by this random process.
ET comparisons among populations were chosen to reflect the distribution of several phenotypes, ranging from genetically simple to increasingly complex, that differ in prevalence among HapMap populations. Estimated genetic complexity was based on the number of GWAS identified genes and their effects sizes, extracted from the NHGRI GWAS catalog (Additional file 2: Table S2). The index phenotypes we analyzed were, in order of presumed increasing genetic complexity, lactase persistence, melanoma, and Type 2 diabetes mellitus/fasting glucose. These phenotypes are likely related to evolutionary adaptations to different environments, making the integration of evolutionary data more promising.
Lactase persistence is the ability to digest lactose in adulthood and is the inverse of lactose intolerance . More than 90 % of Northern European individuals have lactase persistence. In contrast, about half of the population of Southern Europe, and most Asians and Africans, have lactase persistence, as milk has been rarely consumed in these regions (Additional file 1: Table S1). Evolutionary evidence indicates that there has been strong selective pressure in populations with animal domestication and adult milk consumption to maintain the ability to metabolize lactose after weaning .
The genetics of lactase persistence is relatively simple. The primary gene that affects the phenotype is lactase (LCT) in the LCT-MCM6 locus on chromosome 2 , and the identified variations are primarily cis-acting elements . Interestingly, the variants associated with lactase persistence in European and African pastoralist populations are different, even though they are within the same upstream region of LCT . In both cases, the evidence indicates that all variants are under strong selection corresponding to levels of animal milk consumption and pastoralist culture.
To assess the sensitivity of ET to varying levels of FST thresholds, we used multiple thresholds ranging from the 80th/20th to the 90th/10th percentile. The significant over-representation of the LCT-MCM6 signal still held under less stringent thresholds. Under the 90th/10th threshold, we obtained 29 ET SNPs and 14 genes (Additional file 3: Table S3b). Twenty of these SNPs were within ± 100 Kb of LCT-MCM6 region. Under the 85th/15th threshold, we generated 33 ET SNPs and 14 genes (Additional file 3: Table S3c). Again, twenty of these SNPs were within ± 100 Kb of LCT-MCM6 region. Even using the 80th percentile threshold, with FST(CEU_CHB) and FST(CEU_TSI), and the 20th percentile threshold, with FST(TSI_CHB), we generated 50 ET SNPs, of which 23 were within ± 100 Kb of LCT-MCM6 region. (Additional file 3: Table S3d). FST values for other thresholds and populations are presented in Additional file 5: Table S4.
Based on the prevalence of lactase persistence in other populations  (Additional file 1: Table S1), we also applied ET to CEU, YRI and GIH. Highly differentiated SNPs between CEU and GIH and between CEU and YRI were compared to highly similar SNPs between GIH and YRI. With the 95th/5th threshold, we identified 22 SNPs, of which two were within ± 100 Kb of the LCT-MCM6 region (Additional file 6: Table S5). Additionally, using the 95th/5th threshold, we were able to identify the LCT-MCM6 gene region in comparisons of CEU to CHB and LWK (HapMap East African), both of which do not have a large amount of lactase persistence (data not shown).
Two way comparison for lactase persistence
Melanoma and other diseases correlated with latitude
Melanoma was our second selected index phenotype. About 40 genes have been associated with melanoma, according to HuGE Navigator, making it substantially more complex than lactase persistence. The prevalence of melanoma varies with latitude , a relationship likely due to dark skin being favored in regions of high UV radiation, where it protects from skin damage, while still allowing adequate vitamin D to be synthesized. At the same time, dark skin limits UV degradation of folate in underlying skin layers, and low folate levels are associated with increased risk of neural tube defects in utero . In contrast, light skin is more common at high latitudes where its presence promotes vitamin D biosynthesis in environments with low ultraviolet radiation . Melanoma is relatively common in populations of European descent, but its prevalence is much lower in populations of South Asian and African descent (Additional file 1: Table S1). Thus, applying ET analysis to melanoma using the CEU, YRI and GIH populations (with CEU as the outlier population) fits the population prevalence requirements for ET filtering.
Under the stringent 95th/5th percentile threshold we obtained 22 ET SNPs that mapped to 33 ET genes (Additional file 6: Table S5a). One of these 33 genes, SLC45A2, has previously been shown to associate with melanoma . The ET SNP that identified this locus, rs28117, is bounded both proximally and distally by recombination hotspots, thereby providing additional resolution (Additional file 7: Figure S2). Under the 90th/10th percentile threshold, we generated 168 ET SNPs which mapped to 230 ET genes. Under the least stringent 85th/15th percentile threshold, we generated 733 ET SNPs that mapped to 971 ET genes, four of which are known to associate with melanoma (Table 1, Additional file 6: Table S5b, S5c).
This analysis also identified numerous other genes associated with diseases that are distributed appropriately among these three populations. One example is multiple sclerosis (MS), a complex disease with more than 80 genes that have been associated with susceptibility, but each of these variants confers only a very small change in risk. Collectively, these genes explain approximately 20 % of the genetics of MS [16, 17]. Similar to melanoma, the prevalence of MS is associated with latitude; the most prominent correlating factor is ultraviolet radiation/vitamin D . Recently, by showing positive selection on MS associated loci, researchers proposed that MS might have arisen due to pleiotropic effects of host resistance to pathogens over the course of human history, where there has been significant selective pressure acting to increase this resistance to pathogens . Under the 90th/10th percentile threshold, we were able to identify one gene, IL6, previously associated with MS. IL6 is of particular interest as it has been associated with 15 diseases that are similarly distributed, several of which are related to immune function, indicating strong pleiotropic effects (Additional file 8: Table S6).
The ET analysis for melanoma also identified genes associating with many other similarly distributed phenotypes (Additional file 8: Table S6). For example, ET detected LCT under the most stringent 95th/5th FST percentile threshold, consistent with the distribution of lactase persistence. Besides lactase persistence, ET genes under the least stringent 85th/15th threshold have also been associated with other Mendelian disorders having appropriate distributions among the three populations, such as oculocutaneous albinism, glucosephosphate dehydrogenase deficiency and Smith-Lemli-Opitz Syndrome. ET was able to capture the key genes that cause these diseases, namely OCA2, G6PD and DHCR7. Other disorders have also been associated with these ET genes, but the association is not as direct. For example, G6PD has been associated with favism caused by G6PD deficiency that can affect survival of the malaria parasite, Plasmodium falciparum . Malaria is much more common in South Asia and Africa than it is in northern Europe, highlighting another appropriately identified genetic association. ET was also able to capture genes that have been associated with complex diseases having appropriate distributions among the three populations, for example, Vitamin D deficiency, which is strongly correlated with skin color. DHCR7 and CYP24A1 have the largest reported effect sizes for Vitamin D deficiency and were found using ET . Interestingly, another ET gene, BNC2, has also been associated with hair and skin color in a GWAS study . As melanoma is strongly associated with skin color, we expect that genes affecting skin color would also appear. The evolutionary factors that drive these phenotypes are likely common, emphasizing the ability of ET to detect genes that are pleiotropic. ET genes have also been associated with other complex diseases, such as pulmonary tuberculosis, multiple cancers, and Alzheimer’s disease, all of which have relatively large effect sizes (Additional file 8: Table S6). ET comparisons also allowed us to identify key immune regulators, such as IFNG and IL6, which associate with many phenotypes.
Type 2 diabetes mellitus/fasting glucose
Fasting glucose, although measured as a continuous trait, is reflective of Type 2 diabetes mellitus prevalence. Because diabetes case criteria may differ among studies based on variable diagnostic standards we chose to use fasting glucose as our index phenotype, as it is a more objective but highy correlated trait representing the complex genetic architecture of Type 2 diabetes. In the Mexican population, the average fasting glucose level is higher than in Asian and Northern European populations. For this analysis, we used the populations MEX, JPT, and CEU for ET analyses based on current epidemiological data.
We identified no genes that have been associated with Type 2 diabetes mellitus, according to our criteria. Because of this inability to identify appropriate genes we expanded our search to include genes that were within 250 Kb of each ET SNP. Although we discovered no genes associated with Type 2 diabetes mellitus using the most stringent threshold, we did find one gene with the 90th/10th threshold, BHMT, that has been putatively associated with diabetes (β = 0.26; p value 5.9 × 10−3) .
Under the 95th/5th percentile threshold, we found only seven ET SNPs which mapped to four genes. None of these four genes had evidence of genetic association with any phenotype, to date. Using the 90th/10th percentile threshold, we identified 164 ET SNPs that mapped to 115 genes. Again, none of these genes were associated with any disease having the expected population distribution. Under the 85th/15th percentile threshold, we identified 407 ET SNPs that mapped to 233 genes. Of these genes, two had prior evidence of association with diseases with the appropriate population distributions. Both genes, XG and NLGN4X, are related to autism spectrum disorders that are also appropriately distributed among these populations (Additional file 1: Table S1). The p value of the most significant SNP in XG was 3.79 × 10−8 in a family based GWAS . In contrast, in NLGN4X the smallest p value was 0.024 in a candidate gene study . We also found a gene, SHROOM2, that is associated with an Alzheimer’s disease endophenotype, but the GWAS study describing this association only generated a p value of 0.0003 .
Enrichment of ET genes relative to randomly sampled genes
Significant Association of ET Genes (CEU, GIH, YRI) with diseases/traits
Number of ET SNPs
Number of ET genes
Number of diseases appropriately distributed among ET populations
Number of ET genes associated with appropriately distributed diseases
Permutation P value
Comparison to population branch statistic
An existing method, Population Branch Statistic (PBS) also integrates evolutionary comparisons among three populations to search for candidate genes and has been successful in identifying genes related to high altitude adaptation . PBS uses two populations with known differences in phenotype prevalence and a third random outlier population to identify genes likely to associate with a predefined trait. Variants that show high allele frequency differentiation in only one of the two pre-selected populations are assumed to be under selection, reflected by high PBS scores. Unlike our method, there is no phenotype prevalence restriction on the outlier population. We compared ET with PBS to determine if adding the phenotypic prevalence in the third population increases our ability to find disease associating genes. To do this, we first chose CEU and YRI as the two pre-selected populations, then ran PBS analysis using all other available HapMap III populations as the outlier. We selected the top 22 SNPs from PBS (based on the most stringent CEU-YRI-GIH comparison from ET) to assess the performance against ET. SNPs with the highest PBS scores were picked for each population and then were mapped to genes within ± 100 Kb (Additional file 9: Table S7). To determine association, we applied the same criteria as described above for our ET analyses. We found that, at most, three genes, SLC12A1, SLC45A2 and AMACR, are associated with phenotypes that are appropriately distributed between European and West African populations, two of which had been observed using ET. In other analyses using MEX, TSI, GIH, ASW, LWK and MKK as outlier populations, only one appropriate gene was identified, SLC12A1. Using CHD and JPT as outlier populations, we identified no genes associated with differentially distributed disease rates. In contrast, ET was able to identify four genes relevant to diseases (Table 2). It is likely that ET performed better in most cases because we explicitly added a third population with known disease prevalences and this additional information appeared to increase our resolving power.
ET’s performance is related to effect size
In this manuscript, we addressed whether population differences shaped by evolutionary histories can be used to identify or enrich for genes associated with phenotypes that are differentially distributed. We hypothesized that comparing three populations, using the ET algorithm, can enrich for variants or genes that can be compared to standard association results, helping define those of greater interest with respect to phenotype. This, in essence, is an additional filtering method that may allow us to relax standard association significance thresholds, such that we can identify putative disease associating variants that can be targeted for follow-up, even if they do not reach standard genome-wide association significance. In our analyses, we focused on three different phenotypes that are both differentially distributed and have putatively different genetic architectures. ET’s performance appeared to differ among these phenotypes. For the trait with the simplest genetic architecture, lactase persistence, ET was robust; ET defined the causative genetic region using several population comparisons that matched the distribution of lactase persistence [9, 28]. For likely oligogenic phenotypes such as melanoma, ET performed well. Assuming that the number of GWAS hits in each of these two diseases is correlated with the complexity of the genetic models (one for lactase persistence and 22 for melanoma), we can see that as the effect sizes decrease and the number of associating genes increases, which, presumably, is correlated with increasing complexity of the genetic architecture, ET was less effective in identifying previously known genes [29, 30]. It may also be possible that ET is best suited to detect loci that may not only be associating with diseases that are differentially distributed, but with loci that differentiated due to selection, as is the case with genes for lactase persistence and melanoma. Interestingly, standard GWAS found multiple signals for lactase persistence, but after adjusting for ancestry all except LCT, which was identified by ET, became non-significant, demonstrating that ET does not conflate ancestry with association .
Regardless of the FST threshold, ET clearly showed an enrichment for melanoma genes compared to the random resampling analysis, emphasizing its utility. It is also interesting that ET did significantly enrich for genes associated with several other diseases that shared the same prevalence pattern and are presumably related to melanoma. Multiple sclerosis, although sharing the same geographic distribution as melanoma, does not appear to be as good a candidate disease for ET analysis. However, MS appears to be more genetically complex based on the number and effect sizes of genes already identified as associating with it (Additional file 2: Table S2). Although we did identify some genes related to MS, we were not able to see an enrichment for them. MS also appears to be more affected by environmental variation, and risk can change with migration during early adolescence, thereby serving as a modifier of genetic risk . Another important finding from ET in the melanoma analyses is the identification of G6PD. This is of particular interest as G6PD deficiency is known to protect from malarial infection [20, 32–34], although it was not found in the GWAS for malaria . We do, however, note that the failure to identify G6PD in the original malaria GWAS is likely an artifact of poor coverage in this genic region as well as low frequency of the protective alleles in the studied populations. Nonetheless, this demonstrates the potential of ET, to not only refine, but in some cases, to identify important signals missed by standard association that are included in publically available databases but absent from genotyping arrays.
For phenotypes that are polygenic, with putative small effect sizes such as fasting glucose or Type 2 diabetes mellitus, ET’s utility appeared to be more limited. Specifically, the role of environment in confounding the ability of ET to identify risk genes may be seen in diabetes, as the prevalence of Type 2 diabetes is increasing, presumably because of the adoption of lifestyles that promote it. Therefore, even if there are strong diabetes genes in a few environments, their action may not be detectable in many modern human societies where the effect of the environment has become preeminent and time has been inadequate to differentiate genes affecting this phenotype. We may, therefore, conclude that as the environment’s recent role in disease increases, the ability of ET to identify key genes will decrease. This should correlate with the length to which a population has been exposed to a new environment. This is not a feature unique to ET, as this will be the case even with standard association approaches. Nonetheless, even though ET and standard association methods differ substantially in their underlying assumptions and metrics, they may still be used to support each other . In cases where there is significant locus heterogeneity among cohorts, ET may be extremely useful in helping to identify interesting associations in subsets of data that are often ignored in large GWAS meta-analyses. Therefore, despite that fact that ET did not enrich for known diabetes genes, it may have enriched for previously undetected diabetes genes, and can be used to provide additional evidence for follow-up. As with other approaches, ET is not as useful for detecting variants that are rare in all populations, as association offers low power in detecting differences. However, we emphasize that all current population-based designs suffer from this limitation and are unlikely to be practical for the statistical detection of rare variants that associate with disease. Family based (linkage) approaches to identifying rare disease associated genes are still the preferred method in these cases.
ET as an agnostic filter
GWAS have been used to examine almost every common complex disease, and although many SNPs have been dismissed because they do not meet standard multiple testing correction thresholds, it is probable that many of them truly associate with the disease. The detection of these SNPs represents a major challenge in our developing understanding of disease etiology because the current methods produce inflated Type II errors. Thus, although GWAS have been able to discover significant hits in certain diseases, its success is limited in most situations and probably represents the limitations of only using p values for biological data mining , making association result filtering based solely on p value problematic. Therefore, it is important to develop methods that can minimize this type of error while still controlling for the Type I error rate. One way to do this is to integrate knowledge from independent analyses . Integration of several data types has been successful in the analysis of diseases, such as MS . However, pathway-oriented filtering, such as for MS in Baranzini et al., requires prior knowledge of biological functions, which are still poorly understood in many cases. We present an alternative filtering metric based, not on prior biological knowledge, only on evolutionary history that shaped biology. ET also has an advantage over some standard association analyses because, if a variant that increases risk is fixed in a population that has a higher prevalence (i.e., is an etiological agent), it would not be detected in association analyses; however, ET may identify it. A key to the success of ET is having at least three populations, since comparing only two would produce an enormous number of false positives, while adding a third population can reduce this number.
ET only requires knowledge of allele frequencies and disease distributions, but the latter may be a limiting factor for many diseases, but this will change with better surveillance. However, limited surveillance is not the only factor that may affect the utility of ET. For example, variable diagnoses across countries may present an issue as will changing definitions of disease phenotypes over time. The latter will make comparisons over time difficult. The issue of ethnic variation in disease prevalence may, however, be partially ameliorated in coutries that have highly diverse populations, such as the US where data on many continental populations can be collected simultaneously and used to define appropriate ET populations, but this is still not as robust as would be ideal as many populations in such situations are highly admixed. As ET detects differences in local variation the method should still be powerful for the detection of genes that affect disease risk in all populations, i.e. universal genetic risk factors. However, in epistatic models of disease it may not be possible to detect all key genes when local ancestry varies elsewhere in the genome .
Combining FST with other metrics
In this study, we used only FST as a metric of population differentiation. FST is computationally straightforward and thus makes our model easily applied in any population comparisons. Since genetic differentiation could be due to both selection and random drift, and FST does not explicitly differentiate between these two processes, extensions of ET that include measures of selection may be useful for future analyses. An existing statistic that can be used to provide evidence for selection is iHS  (http://haplotter.uchicago.edu/). We investigated iHS patterns in the genomic regions identified in our study using the most stringent FST thresholds among CEU, YRI and GIH. In most regions the iHS scores were larger in CEU than in YRI and ASN (the East Asian HapMap population), indicating stronger selection signals in the CEU population (Additional file 10: Table S8). Clearly, integrating the role of selection into the ET analyses in the future will be an important extension of the method.
Selection may be an important factor in shaping the distribution of disease risk alleles; however, it is not always the mechanism through which important differentiation, which affects phenotypic variation, takes place. One strong example of this is the LCT locus, which has been shown by multiple metrics to be under selection in multiple populations and was identified by ET in our study [9, 41]. Another example of a selected locus affecting disease risk that was identified by ET is APOL1. This gene has been shown to be under selection in African populations, due to protection against human African trypanosomiasis . Different metrics may be able to pick up different signals indicating selection (e.g., Cross Population Extended Haplotype Homozygosity (XP-EHH) , Tajima’s D , Fay and Wu’s H ). In contrast to the phenotypes with comparatively simple genetic architecture in which selection appears to be important in shaping population structure, the role of selection in Type 2 diabetes mellitus is less clear. Recent studies of Type 2 diabetes mellitus genes have indicated that they are not under selective pressure among multiple human populations [46, 47]. However, another paper studying the distribution of Type 2 diabetes mellitus risk alleles indicates that random drift alone has not shaped diabetes’ genetic risk differences from Africa to East Asia , making it difficult to draw strong conclusions at this point. Understanding the role of non-random processes in shaping disease risk will be an important area of research, especially as it affects disease disparity among continental populations in changing environments.
Possible pleiotropy in ET genes
In our analyses of ET genes and their associated diseases, it became obvious that several of the genes identified are risk factors for multiple diseases. Among the genes from the least stringent ET threshold among CEU, YRI and GIH, we discovered that several genes were associated with multiple diseases and traits (Tables 1, Additional file 8: Table S6). For example, IL6 has been associated with 15 phenotypes that are appropriately distributed and IFNG with seven. Both genes are related to immune response, among other processes. It is possible that selection has been important in changing allele frequency distributions, while at the same time having multiple pleiotropic effects that may or may not have been directly driven by selection. Another example is the genes related to melanoma, as they are also associated with skin color, eye color and albinism (SLC45A2 [49, 50]), preterm birth (DHCR7 ), and Vitamin D levels (NADSYN1 ). Interestingly, an ET gene from the most stringent threshold, BNC2, has also been associated with hair and skin color [22, 52, 53]. In this specific case, it seems that skin color is a confounder of other genetic effects. As humans migrated to higher latitudes with insufficient UV radiation, lighter skin color facilitated more vitamin D synthesis in the skin. However, this increased the risk of melanoma. Although having less than five publications indicating associations, BNC2 has also been associated with ovarian cancer in two GWAS, both reaching genome wide significance [54, 55]. We do note that our calculations of the number of associating genes, with appropriately distributed phenotypes, were conservative; each gene was counted only once, even if it associated with multiple pleiotropic phenotypes.
Comparisons to other related methods
Models developed in different studies employ similar evolutionary mindsets to what we have discussed. PBS does not require a specific phenotype prevalence pattern on the third population in the three population comparison, which may make it less powerful than ET in identifying disease associating genes. On the other hand, Hancock et al.  reported that correlations between allele frequencies and climate variables can be applied to detect SNPs associating with pigmentation and autoimmune diseases. In this case, they were looking for local selection due to environmental factors chosen a priori, whereas ET is not limited to specific factors that give rise to the disease prevalence distribution pattern.
Another method was developed to identify loci covarying with the African Pygmy phenotype based on how much additional information is provided by phenotype to infer the geographic origin compared to information provided without genotypes . However, this method requires the phenotype, which is height, to be known for every individual. In our method, we only need to compare disease risk at the population level.
Evolutionary thinking can provide important insights for biomedical research; when combined with current approaches commonly employed in human disease studies, it can increase our ability to find key genes or pathways that affect etiology. By taking advantage of both epidemiological differences and population structure, we demonstrated that many genes associating with diseases can be found. This paper presents a proof of principle for this approach, as well as some of its limitations. Clearly, for phenotypes with simple genetic architecture, ET is an extremely powerful approach, but this becomes less practical for traits of increasingly complex architecture. Nonetheless, for several traits, we were able to identify genes with known effects. It is also possible that many more of the ET genes are truly associating, but have not been reported as such, as they do not meet current thresholds for significance. Therefore, we propose that ET can be a useful filter with which to interrogate existing and new association studies for consistent patterns that might lead to the identification of additional genetic risk factors.
This work was supported by the March of Dimes Prematurity Research Center Ohio Collaborative and grants from the National Institutes of Health (P20 GM103534, 2R01LM010098, and 5R01EY022300-03).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Shriver MD, Kennedy GC, Parra EJ, Lawson HA, Sonpar V, Huang J, Akey JM, Jones KW. The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs. Human genomics. 2004;1(4):274–86.Google Scholar
- Elhaik E. Empirical distributions of F(ST) from large-scale human polymorphism data. PLoS One. 2012;7(11):e49837.View ArticlePubMedPubMed CentralGoogle Scholar
- Wright S. Evolution and the genetics of populations; a treatise. Chicago: University of Chicago Press; 1968.Google Scholar
- International HapMap C, Altshuler DM, Gibbs RA, Peltonen L, Altshuler DM, Gibbs RA, Peltonen L, Dermitzakis E, Schaffner SF, Yu F, et al. Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467(7311):52–8.Google Scholar
- Weir BS, Cockerham C. Estimating F-statistics for the analysis of population structure. Evolution. 1984;38(6):1358–70.PubMedGoogle Scholar
- Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ. A navigator for human genome epidemiology. Nat Genet. 2008;40(2):124–5.View ArticlePubMedGoogle Scholar
- Marigorta UM, Lao O, Casals F, Calafell F, Morcillo-Suarez C, Faria R, Bosch E, Serra F, Bertranpetit J, Dopazo H et al. Recent human evolution has shaped geographical differences in susceptibility to disease. BMC Genomics. 2011;12:55.Google Scholar
- Swallow DM. Genetics of lactase persistence and lactose intolerance. Annu Rev Genet. 2003;37:197–219.View ArticlePubMedGoogle Scholar
- Tishkoff SA, Reed FA, Ranciaro A, Voight BF, Babbitt CC, Silverman JS, et al. Convergent adaptation of human lactase persistence in Africa and Europe. Nat Genet. 2007;39(1):31–40.View ArticlePubMedGoogle Scholar
- Wang Y, Harvey CB, Pratt WS, Sams VR, Sarner M, Rossi M, Powell K, Mortensen HM, Hirbo JB, Osman M,et al. The lactase persistence/non-persistence polymorphism is controlled by a cis-acting element. Hum Mol Genet. 1995;4(4):657–62.Google Scholar
- Itan Y, Jones BL, Ingram CJ, Swallow DM, Thomas MG. A worldwide correlation of lactase persistence phenotype and genotypes. BMC Evol Biol. 2010;10:36.View ArticlePubMedPubMed CentralGoogle Scholar
- Crombie IK. Racial differences in melanoma incidence. Br J Cancer. 1979;40(2):185–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Jablonski NG, Chaplin G. Colloquium paper: human skin pigmentation as an adaptation to UV radiation. Proc Natl Acad Sci U S A. 2010;107 Suppl 2:8962–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Loomis WF. Skin-pigment regulation of vitamin-D biosynthesis in man. Science. 1967;157(3788):501–6.View ArticlePubMedGoogle Scholar
- Barrett JH, Iles MM, Harland M, Taylor JC, Aitken JF, Andresen PA, Akslen LA, Armstrong BK, Avril MF, Azizi E, et al. Genome-wide association study identifies three new melanoma susceptibility loci. Nat Genet. 2011;43(11):1108–13.Google Scholar
- International Multiple Sclerosis Genetics C, Wellcome Trust Case Control C, Sawcer S, Hellenthal G, Pirinen M, Spencer CC, Patsopoulos NA, Moutsianas L, Dilthey A, Su Z, et al. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis. Nature. 2011;476(7359):214–9.Google Scholar
- International Multiple Sclerosis Genetics C, Beecham AH, Patsopoulos NA, Xifara DK, Davis MF, Kemppinen A, Cotsapas C, Shah TS, Spencer C, Booth D, et al. Analysis of immune-related loci identifies 48 new susceptibility variants for multiple sclerosis. Nat Genet. 2013;45(11):1353–60.Google Scholar
- Simpson Jr S, Blizzard L, Otahal P, Van der Mei I, Taylor B. Latitude is significantly associated with the prevalence of multiple sclerosis: a meta-analysis. J Neurol Neurosurg Psychiatry. 2011;82(10):1132–41.View ArticlePubMedGoogle Scholar
- Raj T, Kuchroo M, Replogle JM, Raychaudhuri S, Stranger BE, De Jager PL. Common risk alleles for inflammatory diseases are targets of recent positive selection. Am J Hum Genet. 2013;92(4):517–29.View ArticlePubMedPubMed CentralGoogle Scholar
- Luzzatto L. G6PD deficiency and malaria selection. Heredity. 2012;108(4):456.View ArticlePubMedGoogle Scholar
- Ahn J, Yu K, Stolzenberg-Solomon R, Simon KC, McCullough ML, Gallicchio L, Jacobs EJ, Ascherio A, Helzlsouer K, Jacobs KB, et al. Genome-wide association study of circulating vitamin D levels. Hum Mol Genet. 2010;19(13):2739–45.Google Scholar
- Eriksson N, Macpherson JM, Tung JY, Hon LS, Naughton B, Saxonov S, Avey L, Wojcicki A, Pe'er I, Mountain J. Web-based, participant-driven studies yield novel genetic associations for common traits. PLoS Genet. 2010;6(6):e1000993.Google Scholar
- Xie W, Wood AR, Lyssenko V, Weedon MN, Knowles JW, Alkayyali S, Assimes TL, Quertermous T, Abbasi F, Paananen J, et al. Genetic variants associated with glycine metabolism and their role in insulin sensitivity and type 2 diabetes. Diabetes. 2013;62(6):2141–50.Google Scholar
- Chang SC, Pauls DL, Lange C, Sasanfar R, Santangelo SL. Sex-specific association of a common variant of the XG gene with autism spectrum disorders. American journal of medical genetics Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics. 2013;162B(7):742–50.View ArticleGoogle Scholar
- Chakrabarti B, Dudbridge F, Kent L, Wheelwright S, Hill-Cawthorne G, Allison C, Banerjee-Basu S, Baron-Cohen S. Genes related to sex steroids, neural growth, and social-emotional behavior are associated with autistic traits, empathy, and Asperger syndrome. Autism research : official journal of the International Society for Autism Research. 2009;2(3):157–77.Google Scholar
- Meda SA, Narayanan B, Liu J, Perrone-Bizzozero NI, Stevens MC, Calhoun VD, Glahn DC, Shen L, Risacher SL, Saykin AJ, et al. A large scale multivariate parallel ICA method reveals novel imaging-genetic relationships for Alzheimer’s disease in the ADNI cohort. NeuroImage. 2012;60(3):1608–21.Google Scholar
- Yi X, Liang Y, Huerta-Sanchez E, Jin X, Cuo ZX, Pool JE, Xu X, Jiang H, Vinckenbosch N, Korneliussen TS, et al. Sequencing of 50 human exomes reveals adaptation to high altitude. Science. 2010;329(5987):75–8.Google Scholar
- Enattah NS, Sahi T, Savilahti E, Terwilliger JD, Peltonen L, Jarvela I. Identification of a variant associated with adult-type hypolactasia. Nat Genet. 2002;30(2):233–7.View ArticlePubMedGoogle Scholar
- Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38(8):904–9.View ArticlePubMedGoogle Scholar
- Gerstenblith MR, Shi J, Landi MT. Genome-wide association studies of pigmentation and skin cancer: a review and meta-analysis. Pigment Cell Melanoma Res. 2010;23(5):587–606.View ArticlePubMedPubMed CentralGoogle Scholar
- Ebers GC. Environmental factors and multiple sclerosis. Lancet Neurol. 2008;7(3):268–77.View ArticlePubMedGoogle Scholar
- Howes RE, Piel FB, Patil AP, Nyangiri OA, Gething PW, Dewi M, Hogg MM, Battle KE, Padilla CD, Baird JK, et al. G6PD deficiency prevalence and estimates of affected populations in malaria endemic countries: a geostatistical model-based map. PLoS Med. 2012;9(11):e1001339.Google Scholar
- Tishkoff SA, Varkonyi R, Cahinhinan N, Abbes S, Argyropoulos G, Destro-Bisol G, Drousiotou A, Dangerfield B, Lefranc G, Loiselet J, et al. Haplotype diversity and linkage disequilibrium at human G6PD: recent origin of alleles that confer malarial resistance. Science. 2001;293(5529):455–62.Google Scholar
- Bienzle U, Ayeni O, Lucas AO, Luzzatto L. Glucose-6-phosphate dehydrogenase and malaria. Greater resistance of females heterozygous for enzyme deficiency and of males with non-deficient variant. Lancet. 1972;1(7742):107–10.View ArticlePubMedGoogle Scholar
- Jallow M, Teo YY, Small KS, Rockett KA, Deloukas P, Clark TG, Kivinen K, Bojang KA, Conway DJ, Pinder M, et al. Genome-wide and fine-resolution association analysis of malaria in West Africa. Nat Genet. 2009;41(6):657–65.Google Scholar
- Ciesielski T, Pendergrass S, White M, Kodaman N, Sobota R, Huang M, Bartlett J, Li J, Pan Q, Gui J, et al. Diverse convergent evidence in the genetic analysis of complex disease: coordinating omic, informatic, and experimental evidence to better identify and validate risk factors. BioData mining. 2014;7(1):10.Google Scholar
- Malley JD, Dasgupta A, Moore JH. The limits of p-values for biological data mining. BioData mining. 2013;6(1):10.View ArticlePubMedPubMed CentralGoogle Scholar
- Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BM, Kappos L, Gene MSAC et al. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009;18(11):2078–90.Google Scholar
- Greene CS, Penrod NM, Williams SM, Moore JH. Failure to replicate a genetic association may provide important clues about genetic architecture. PLoS One. 2009;4(6):e5639.View ArticlePubMedPubMed CentralGoogle Scholar
- Voight BF, Kudaravalli S, Wen X, Pritchard JK. A map of recent positive selection in the human genome. PLoS Biol. 2006;4(3):e72.View ArticlePubMedPubMed CentralGoogle Scholar
- Wagh K, Bhatia A, Alexe G, Reddy A, Ravikumar V, Seiler M, Boemo M, Yao M, Cronk L, Naqvi A, et al. Lactase persistence and lipid pathway selection in the Maasai. PLoS One. 2012;7(9):e44751.Google Scholar
- Ko WY, Rajan P, Gomez F, Scheinfeldt L, An P, Winkler CA, Froment A, Nyambo TB, Omar SA, Wambebe C, et al. Identifying Darwinian selection acting on different human APOL1 variants among diverse African populations. Am J Hum Genet. 2013;93(1):54–66.Google Scholar
- Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449(7164):913–8.Google Scholar
- Tajima F. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics. 1989;123(3):585–95.PubMedPubMed CentralGoogle Scholar
- Fay JC, Wu CI. Hitchhiking under positive Darwinian selection. Genetics. 2000;155(3):1405–13.PubMedPubMed CentralGoogle Scholar
- Southam L, Soranzo N, Montgomery SB, Frayling TM, McCarthy MI, Barroso I, et al. Is the thrifty genotype hypothesis supported by evidence based on confirmed type 2 diabetes- and obesity-susceptibility variants? Diabetologia. 2009;52(9):1846–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Ayub Q, Moutsianas L, Chen Y, Panoutsopoulou K, Colonna V, Pagani L, Prokopenko I, Ritchie GR, Tyler-Smith C, McCarthy MI, et al. Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes. Am J Hum Genet. 2014;94(2):176–85.Google Scholar
- Corona E, Chen R, Sikora M, Morgan AA, Patel CJ, Ramesh A, Bustamante CD, Butte AJ. Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS Genet. 2013;9(5):e1003447.Google Scholar
- Beleza S, Johnson NA, Candille SI, Absher DM, Coram MA, Lopes J, Campos J, Araujo, II, Anderson TM, Vilhjalmsson BJ, et al. Genetic architecture of skin and eye color in an African-European admixed population. PLoS Genet. 2013;9(3):e1003372.Google Scholar
- Stokowski RP, Pant PVK, Dadd T, Fereday A, Hinds DA, Jarman C, Filsell W, Ginger RS, Green MR, van der Ouderaa FJ, et al. A genomewide association study of skin pigmentation in a South Asian population. Am J Hum Genet. 2007;81(6):1119–32.Google Scholar
- Bream EN, Leppellere CR, Cooper ME, Dagle JM, Merrill DC, Christensen K, Simhan HN, Fong CT, Hallman M, Muglia LJ, et al. Candidate gene linkage approach to identify DNA variants that predispose to preterm birth. Pediatr Res. 2013;73(2):135–41.Google Scholar
- Visser M, Palstra RJ, Kayser M. Human skin color is influenced by an intergenic DNA polymorphism regulating transcription of the nearby BNC2 pigmentation gene. Hum Mol Genet. 2014;23:5750.View ArticlePubMedGoogle Scholar
- Jacobs LC, Wollstein A, Lao O, Hofman A, Klaver CC, Uitterlinden AG, Nijsten T, Kayser M, Liu F. Comprehensive candidate gene study highlights UGT1A and BNC2 as new genes determining continuous skin color variation in Europeans. Hum Genet. 2013;132(2):147–58.Google Scholar
- Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, et al. A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet. 2010;42(10):874–9.Google Scholar
- Song H, Ramus SJ, Tyrer J, Bolton KL, Gentry-Maharaj A, Wozniak E, Anton-Culver H, Chang-Claude J, Cramer DW, DiCioccio R, et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat Genet. 2009;41(9):996–1000.Google Scholar
- Hancock AM, Witonsky DB, Alkorta-Aranburu G, Beall CM, Gebremedhin A, Sukernik R, Utermann G, Pritchard JK, Coop G, Di Rienzo A. Adaptations to climate-mediated selective pressures in humans. PLoS Genet. 2011;7(4):e1001375.Google Scholar
- Mendizabal I, Marigorta UM, Lao O, Comas D. Adaptive evolution of loci covarying with the human African Pygmy phenotype. Hum Genet. 2012;131(8):1305–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Yoshizawa J, Abe Y, Oiso N, Fukai K, Hozumi Y, Nakamura T, Narita T, Motokawa T, Wakamatsu K, Ito S, et al. Variants in melanogenesis-related genes associate with skin cancer risk among Japanese populations. J Dermatol. 2014;41(4):296–302.Google Scholar
- Gudbjartsson DF, Sulem P, Stacey SN, Goldstein AM, Rafnar T, Sigurgeirsson B, Benediktsdottir KR, Thorisdottir K, Ragnarsson R, Sveinsdottir SG, et al. ASIP and TYR pigmentation variants associate with cutaneous melanoma and basal cell carcinoma. Nat Genet. 2008;40(7):886–91.Google Scholar
- Nelson HH, Kelsey KT, Mott LA, Karagas MR. The XRCC1 Arg399Gln polymorphism, sunburn, and non-melanoma skin cancer: evidence of gene-environment interaction. Cancer Res. 2002;62(1):152–5.PubMedGoogle Scholar
- Cauchi S, Ezzidi I, El Achhab Y, Mtiraoui N, Chaieb L, Salah D, Nejjari C, Labrune Y, Yengo L, Beury D, et al. European genetic variants associated with type 2 diabetes in North African Arabs. Diab Metab. 2012;38(4):316–23.Google Scholar
- Fontaine-Bisson B, Renstrom F, Rolandsson O, Magic, Payne F, Hallmans G, Barroso I, Franks PW. Evaluating the discriminative power of multi-trait genetic risk scores for type 2 diabetes in a northern Swedish population. Diabetologia. 2010;53(10):2155–62.Google Scholar
- Qian Y, Liu S, Lu F, Li H, Dong M, Lin Y, Gong J, Jin G, et al. Genetic variant in fat mass and obesity-associated gene associated with type 2 diabetes risk in Han Chinese. BMC Genet. 2013;14:86.Google Scholar
- Lam VK, Ma RC, Lee HM, Hu C, Park KS, Furuta H, Wang Y, Tam CH, Sim X, Ng DP, et al. Genetic associations of type 2 diabetes with islet amyloid polypeptide processing and degrading pathways in asian populations. PLoS One. 2013;8(6):e62378.Google Scholar
- Zhang SM, Xiao JZ, Ren Q, Han XY, Tang Y, Yang WY, Ji LN. Replication of association study between type 2 diabetes mellitus and IGF2BP2 in Han Chinese population. Chin Med J. 2013;126(21):4013–8.Google Scholar
- Yin YW, Sun QQ, Zhang BB, Hu AM, Liu HL, Wang Q, Zeng YH, Xu RJ, Zhang ZD, Zhang ZG. Association between the interleukin-6 gene −572 C/G polymorphism and the risk of type 2 diabetes mellitus: a meta-analysis of 11,681 subjects. Ann Hum Genet. 2013;77(2):106–14.Google Scholar
- Sandholt CH, Vestmar MA, Bille DS, Borglykke A, Almind K, Hansen L, Sandbaek A, Lauritzen T, Witte D, Jorgensen T, et al. Studies of metabolic phenotypic correlates of 15 obesity associated gene variants. PLoS One. 2011;6(9):e23531.Google Scholar
- Grarup N, Stender-Petersen KL, Andersson EA, Jorgensen T, Borch-Johnsen K, Sandbaek A, Lauritzen T, Schmitz O, Hansen T, Pedersen O. Association of variants in the sterol regulatory element-binding factor 1 (SREBF1) gene with type 2 diabetes, glycemia, and insulin resistance: a study of 15,734 Danish subjects. Diabetes. 2008;57(4):1136–42.Google Scholar