Skip to main content

Database mining for selection of SNP markers useful in admixture mapping



New technologies make it possible for the first time to genotype hundreds of thousands of SNPs simultaneously. A wealth of genomic information in the form of publicly available databases is underutilized as a potential resource for uncovering functionally relevant markers underlying complex human traits. Given the huge amount of SNP data available from the annotation of human genetic variation, data mining is a reasonable approach to investigating the number of SNPs that are informative for ancestry information.


The distribution and density of SNPs across the genome of African and European populations were extensively investigated by using the HapMap, Affymetrix, and Illumina SNP databases. We exploited these resources by mining the data available from each of these databases to prioritize potential candidate SNPs useful for admixture mapping in complex human diseases and traits. Over 4 million SNPs were compared between Africans and Europeans on the basis of a pre-specified recommended allele frequency difference (delta) value of ≥ 0.3.


The method identified 15% of HapMap, 11% of Affymetrix, and 14% of Illumina SNP sets as candidate SNPs, termed ancestry informative markers (AIMs). These AIM panels with assigned rs numbers, allele frequencies in each ethnic group, delta value, and map positions are all posted on our website All marker information in this data set is freely and publicly available without restriction.


The selected SNP sets represent valuable resources for admixture mapping studies. The overlap between selected AIMs by this single measure of marker informativeness in the different platforms is discussed.

Peer Review reports


The chromosome of an individual from a recently admixed population such as the African-American population contains large stretches of DNA that resemble mosaics of chromosomal segments [1], each derived from European or African ancestry that have not had sufficient time to break up as a result of recombination. Hence, allelic associations in these populations may extend over distances as large as 20–30 cM [2, 3]. Methods to map genes that rely on admixture may therefore require fewer markers to screen the genome than would other approaches for mapping complex disease genes [4, 5].

Theoretically, any marker [610] that has an allele frequency difference between ancestral populations, known as ancestry informative markers (AIMs), can be used for admixture mapping. Such markers can also be used to control for population confounding by variations in background ancestry via structural association testing (SAT) [11]. The ideal AIM has one allele that is monomorphic in one population (p = 1.0) and that is not present in another [12]. However, most alleles are shared among populations [1315]. Hence, it is important to identify and choose informative AIMs across populations [16].

Several single nucleotide polymorphism (SNP) panels have been reported over the past few years [7, 8, 1619] with a focus on identifying markers suitable for admixture studies. Smith et al. [9] screened 744 microsatellite markers for AIMs in 4 different populations and identified a genome spanning set of 315 markers (average spacing 10 cM, frequency difference > 0.3) for mapping in African-Americans and 214 markers (average spacing of 16 cM, frequency difference > 0.25) for mapping in Hispanics. Ninety-seven AIMs were identified for mapping in African-American populations that show limited variation within Africa [10].

Recently 3011 SNP AIMs were reported for studying African-American populations [19], who have an average of 80% African and 20% European ancestry, after screening 450,000 SNPs for which allele frequencies were available. This panel is considered the gold standard for admixture mapping in this population. However, the SNPs used to develop these AIMs came mostly from African-American (98.6%, over 443,916 SNPs) populations, and the ancestral West African frequencies were inferred or estimated by using the expectation-maximization (EM) algorithm [20] rather than by being directly measured.

To date, only a limited amount of information characterizing SNPs across the human genome [21, 22] for the majority of ethnic groups is found in the literature [23]. Consequently, mining of SNP frequencies from HapMap and other genomic data sets including Affymetrix 500 K and Illumina 100 K SNPs with an ethnic-dependent background across the genome is an economical, rapid, and practical strategy for developing a more comprehensive and informative panel of AIMs [19, 24]. This may result in a uniform resource that describes nucleotide diversity with sufficient power to infer ancestry for admixed populations [25], beyond the currently available lists of AIMs. The objectives of the present study were to mine databases and develop AIM panels useful in admixture mapping and compare the selected set of AIMs with the widely used AIM panels.



SNP markers deposited by the HapMap project, 500 K Affymetrix, 100 K Illumina, and the recently published 3011AIM SNP panels for all autosomal and sex chromosomes were used to determine AIMs. The distribution of SNPs in each chromosome and database is shown in Table 1.

Table 1 Number of SNPs investigated for data-mining of AIMs for each chromosome for both Yorubans and European populations listed by genotypic platform or source.

Data mining, processing, and description

We downloaded the HapMap allele frequency data (, March 13, 2007 release). The HapMap project contains genotypes from 60 unrelated individuals (parents) from Yorubans in Ibadan, Nigeria [YRI] and 60 unrelated (parents) Caucasians from the United States with northern and western European ancestry [CEU]. There are ~5.8 million SNPs in the combined phase 1 and phase 2 HapMap projects [26, 27].

The Affymetrix 500 K was downloaded from The Affymetrix 500 K array sets contain "quasi-random" or anonymous SNPs that are spread evenly across the genome and are selected on the basis of information content and could lie between genes. These SNPs were developed for genome-wide association and fine mapping studies. The sample sizes used to generate allele frequency datasets of the 500 K SNP array consist of 48 samples containing 13 trios (5 HapMap CEU trios, 5 HapMap Yoruban trios, and 3 other non-HapMap trios) and 9 unrelated HapMap Asian samples. In total, 39 of the 48 samples are parts of the samples used in the HapMap project. About 365,000 or 73% of the Affymetrix 500 K SNPs have also been typed by the HapMap Project.

The Illumina 100 K was downloaded from This panel is a gene-centric collection of SNPs (70% of which are located in exons or within 10 kb of transcripts) developed for genome-wide association studies. The sample sizes used to generate the Illumina 100 K allele frequency data were 32 CEU and 28 Yorubans. Close to 57,000 of the Illumina 100 K SNPs are in the HapMap project. The recently selected panel of 3011 AIM allele frequencies was obtained from A total of 66 European Americans and 64 African Americans genotyped by different platforms were used to generate the 3011 AIMs from the total of 450,000 SNPs screened.

Each of these data sets, which differ in the way the SNPs were selected [28], has characteristics that make it useful for the current investigation. The HapMap offers an extensive collection of SNPs across ancestral population genomes; the Affymetrix 500 K is a comprehensive widely used chip; the Illumina 100 K has a gene-centric focus; and the AIM panel is the current gold standard SNP panel used in admixture mapping.

Data analysis

A computer program using Python was written to export and pre-process the SNP information from the HapMap databases (the codes are available upon request). A SAS [29] program was used to analyze the data. We used 3 criteria to select the markers to be considered in our analysis: (1) the SNP should be shared between the 2 ancestral populations, (2) a specific marker is retained if it has a delta-value (i.e., the allele frequency difference between 2 parental populations) of 0.3 or higher (a cutoff that has been suggested for AIMs [10], and (3) the physical distance between consecutive selected SNPs must be at least 0.3 cM to avoid the probability of choosing 2 redundant SNPs that are in strong LD [30, 31]. It is expected that markers that are sufficiently spaced throughout the genome will offer independent information about genetic background or ancestry. In each of the 0.3 cM bin, AIMs with the highest delta value were selected to maximize information content of ancestry.

Several methods for measuring marker informativeness for ancestry have been developed and discussed by Rosenberg et al. [12] and others [19, 32]. However, the absolute allele frequency difference (delta) is the most commonly used measure of informativeness for ancestry between 2 parental populations [12]. Marker informativeness for ancestry can be ascertained through the absolute value of the difference in the frequency of a particular allele observed for 2 ancestral populations. If we let p 11 represent the frequency of a reference allele in the first parental population and p 21 the frequency of the same allele in the second parental population, then the delta value is given by δ = |p 11 - p 12|. A marker with a delta value of 1 provides perfect information regarding its ancestry, whereas a marker with a delta value of 0 carries no information for ancestry.


1. SNP allele frequency characterization, racial variation, and databases

Of the total HapMap SNPs for which both Yoruban and CEU allele frequencies were available, we extracted all the monomorphic SNPs and SNPs with various levels of polymorphism, including 100% informative SNPs between the ancestral populations. Table 2 compares the allele frequency distributions under each scenario of the different databases and shows that there is a slight increase in the proportion of rare variation in the Affymetrix and Illumina groups. From the characterized HapMap, Affymetrix, and Illumina SNP databases, 17.3%, 2.6%, and 1.3%, respectively, were 100% noninformative for ancestry.

Table 2 Distribution of allele frequency differences (Yoruba vs. European) across SNP marker databases

A summary of the interpopulation differences using the HapMap databases shows that a total of only 30 of the interpopulation marker comparisons had very large frequency differences or 100% informative for ancestry (delta = 1) between the 2 ancestral groups (Table 2). The few 100% informative SNPs for ancestry in these findings are consistent with prior studies [33, 34], showing that most DNA variation is shared among human populations.

Using a prespecified recommended allele frequency difference (delta) value of ≥ 0.3, on the average across the databases and genome, 15% of HapMap, 19% of Affymetrix, and 15% of Illumina SNP sets were AIMs (Table 3). However, only 15507 (0.42%) HapMap SNPs had an allele frequency difference of 0.7 and above. Similar to the case with CEU, there were large discrepancies in allele frequencies between SNP data for Yoruban populations from the different databases. For example, the reported allele frequencies of the rs55543 SNP from the HapMap, Affymetrix, and Illumina databases were 0.34, 0.31, and 0.42 generated from sample sizes of 120, 48, and 60 samples, respectively. We suspect that the differences in SNP allele frequency data in the different databases were likely due to small sample sizes and respective large sampling errors of the estimates as suggested by Dvornyk et al. [23]. The SNP AIM characteristics with assigned rs numbers, allele frequencies in each ethnic group, delta value, and map positions are all posted on our website All marker information in this data set is freely and publicly available without restriction.

Table 3 Number of AIMs and percentage with delta ≥ 0.3 (in parentheses) for HapMap, Affymetrix, Illumina and AIM databases.

2. Number of overlapping AIMs selected from different platforms

We compared selected AIM lists from among the different databases. Even though 57,000 Illumina 100 K SNPs are in the HapMap dataset, there were no common SNPs selected as AIMs when we used a 0.3 and above delta threshold of informativeness. Interestingly, the recent 3011 AIM panel [19] developed from databases such as Applied Biosystems, Applera, SeatleSNPs, and dbSNP is well represented in HapMap (total of 1479 SNPs were common with HapMap as AIMs). Affymetrix 500 K and HapMap have about 365,000 common SNPs. However, for AIMs with a delta value of ≥ 0.3, there were only 26,388 sets of SNPs overlapping between the 2 databases. As AIMs, few SNPs appeared in both Affymetrix and the recently developed AIM panel; the same was true for HapMap, Affymetrix, and the recently developed AIM panel.

However, there was no overlap in the selected AIMs among Affymetrix, Illumina, HapMap, and the recently developed AIM panel (Table 4). This is not surprising because the SNP selection criteria for each platform differed. For example, Affymetrix SNPs are based on proximity to a restriction site and even distribution across the genome, whereas the Illumina platform SNPs are selected in gene-rich regions and thus are not evenly distributed across the genome [28]. Combining nonoverlapping SNPs from different platforms seems a viable approach to increase power and detect signals across the genome.

Table 4 Number of overlapping SNP AIMs selected by different platforms (HapMap, Affymetrix, Illumina, and AIMs).

However, most SNPs are not fixed among ancestral populations and so we cannot rule out the chance that the delta measures of informativeness pick different markers in the different platforms. Moreover, the average sample size (number of individuals) or DNA samples in each of the 2 populations used to estimate allele frequencies and the laboratory procedures used vary between platforms. For instance HapMap data were based on 120 samples, Affymetrix was based on 48 samples, and Illumina used 60 samples. Hence, we believe that the selected SNPs that are present in at least 2 platforms could be considered to be the best candidates for admixture mapping.

3. Private SNP data set

We observed significant differences in allele frequencies of few SNPs in the present study. These SNPs with significant variation in allele frequencies in populations of different ethnicity may be appropriate for studying the genetic basis of between-ethnic differences in the rates of complex diseases. Although the small sample sizes in this study preclude any definite conclusion regarding the complete absence of a particular allele in any given population, we observed 30 HapMap SNPs (0.001%) that were segregating in only one population sample ("private SNPs"). Most of these private SNPs (77%) were segregating in the African sample, although private SNPs were also observed for European populations. This may owe to the fact that African populations harbor more unique polymorphic alleles than non-African populations [35]. Follow-up studies of the highly differentiated regions might provide significant insight into phenotypic diversity, selection and local adaptation between populations. No private SNPs were observed in the Affymetrix and Illumina data sets.


The SNP databases are important resources for performing genetic linkage, association, and admixture studies. Both academic and commercial groups are developing large numbers of genome-wide SNP datasets. These databases now contain over 12.6 million SNPs. However, only a small fraction of these SNPs are well characterized and validated [21]. Users of these data sets have several common questions regarding the existing databases, including the following: What is the frequency spectrum of the SNPs in these databases? What is the distribution picture of these SNPs across different ethnic and geographic populations? What fraction of the total number of SNPs is already captured by these databases?

We mined and compared the HapMap SNP database against Affymetrix 500 K and the gene centric Illumina 100 K SNP chips. This comparison suggests that a relatively large fraction (> 80%) of SNPs in these databases do not meet the cutoff for acceptable markers as AIMs [10], which means that they are either of very low frequency or not ancestry informative between the 2 ancestral populations. As a result, we developed and preset the AIM panels for each database individually. Our analyses showed that the SNP databases in their current status might have some limitation for studies of complex disorders, especially in different ethnic groups, as a result of incomplete or uneven representation of SNPs along the genome [23]. As indicated above, the different databases have different sets of SNPs. Because the SNP allele frequencies were determined by different genotyping labs that used different sample sizes and genotyping methods (see Methods), it would be difficult to perform several tests to assess data quality and identify sources of experimental variation. In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, many of our analyses relied on data derived from available databases with contents that are, and will continue to be for some time, in a state of change. Moreover, the allele frequencies across the platforms were based on different sets of DNA samples. Therefore, our results represent a snapshot based on currently available data, and ultimately, when the human genome annotation becomes more stable, it will be important to verify these results. Second, the SNP allele frequencies were determined by using relatively small sample sizes (see Methods), and stochastic variation could affect the robustness of our conclusions.

Several studies discussed the similarities between human populations in terms of genetic constituents, and hence a large sample size may enable the detection of small differences in rare outcomes. Although we observed a strong correlation in allele frequencies between SNPs from different platforms (data not shown), confirming these allele frequency estimates in a larger sample size will be important. The analytical caveats associated with each database, such as how surrogates are Yorubans or CEU to each ancestral population and how much of the data (for example, in HapMap) is transferable to the diverse populations in Africa where there is extreme adaptive variation along the various countries is also debatable.

Most studies consider Europe as a relatively homogeneous population. Consequently, it has been argued that European population stratification does not represent a substantial source of bias in epidemiologic studies [36]. However, recent autosomal SNP studies have highlighted significant patterns of structure within Europe along a north-south axis [37] and also the presence of several significant axes of stratification within Europe, most prominently in a northern-southeastern trend, but also along an east-west axis. The study emphasized the importance of considering population stratification in studies using European and European-American individuals, and the need to develop EuroAIMs (European ancestry informative markers) for ancestry estimation and correction [38]. Moreover, the fundamental theorem underpinning HapMap is the common disease common variance (CD/CV) hypothesis [39]. How much information we can capture from rare variants is not clear [40].


We presented AIM databases for all SNPs that show promise in distinguishing ancestral populations and thus that will be useful in admixture mapping for finding loci influencing complex phenotypes. These databases will also be useful for controlling stratification (or confounding factors) when the variation in admixture levels among individuals causes false-positive associations in genetic association studies. This investment will result in a unique genetic resource of high quality and global importance for genetic studies in admixed populations. Its size and complexity will allow systematic research into the genetics of many complex disorders in admixed populations and thus, by serving a wide variety of disciplines, will feed research in this promising area for many years to come.


  1. Reich D, Patterson N: Will admixture mapping work to find disease genes?. Philos Trans R Soc Lond B Biol Sci. 2005, 360 (1460): 1605-1607. 10.1098/rstb.2005.1691.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Stephens JC, Briscoe D, O'Brien SJ: Mapping by admixture linkage disequilibrium in human populations: limits and guidelines. Am J Hum Genet. 1994, 55 (4): 809-824.

    CAS  PubMed  PubMed Central  Google Scholar 

  3. McKeigue PM: Mapping genes underlying ethnic differences in disease risk by linkage disequilibrium in recently admixed populations. Am J Hum Genet. 1997, 60 (1): 188-196.

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Hirschhorn JN, Daly MJ: Genome-wide association studies for common diseases and complex traits. Nat Rev Genet. 2005, 6 (2): 95-108. 10.1038/nrg1521.

    Article  CAS  PubMed  Google Scholar 

  5. Smith MW, O'Brien SJ: Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Rev Genet. 2005, 6 (8): 623-632. 10.1038/nrg1657.

    Article  CAS  PubMed  Google Scholar 

  6. Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, Tyler-Smith C, Carter N, Scherer SW, Tavare S, Deloukas P, Hurles ME, Dermitzakis ET: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315 (5813): 848-853. 10.1126/science.1136678.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R, Ferrell RE: Ethnic-affiliation estimation by use of population-specific DNA markers. Am J Hum Genet. 1997, 60 (4): 957-964.

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, Forrester T, Allison DB, Deka R, Ferrell RE, Shriver MD: Estimating African American admixture proportions by use of population-specific alleles. Am J Hum Genet. 1998, 63 (6): 1839-1851. 10.1086/302148.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Smith MW, Lautenberger JA, Shin HD, Chretien JP, Shrestha S, Gilbert DA, O'Brien SJ: Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. Am J Hum Genet. 2001, 69 (5): 1080-1094. 10.1086/323922.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Collins-Schramm HE, Phillips CM, Operario DJ, Lee JS, Weber JL, Hanson RL, Knowler WC, Cooper R, Li H, Seldin MF: Ethnic-difference markers for use in mapping by admixture linkage disequilibrium. Am J Hum Genet. 2002, 70 (3): 737-750. 10.1086/339368.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Redden DT, Divers J, Vaughan LK, Tiwari HK, Beasley TM, Fernandez JR, Kimberly RP, Feng R, Padilla MA, Liu N, Miller MB, Allison DB: Regional admixture mapping and structured association testing: conceptual unification and an extensible general linear model. PLoS Genet. 2006, 2 (8): e137-10.1371/journal.pgen.0020137.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. Am J Hum Genet. 2003, 73 (6): 1402-1422. 10.1086/380416.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Cavalli-Sforza LL, Menozzi P, Piazza A: The history and geography of human genes'. 1994, Princeton University Press, Princeton, NJ

    Google Scholar 

  14. Deka R, Shriver MD, Yu LM, Ferrell RE, Chakraborty R: Intra- and inter-population diversity at short tandem repeat loci in diverse populations of the world. Electrophoresis. 1995, 16 (9): 1659-1664. 10.1002/elps.11501601275.

    Article  CAS  PubMed  Google Scholar 

  15. Edwards AW: Human genetic diversity: Lewontin's fallacy. Bioessays. 2003, 25 (8): 798-801. 10.1002/bies.10315.

    Article  CAS  PubMed  Google Scholar 

  16. Shriver MD, Parra EJ, Dios S, Bonilla C, Norton H, Jovel C, Pfaff C, Jones C, Massac A, Cameron N, Baron A, Jackson T, Argyropoulos G, Jin L, Hoggart CJ, McKeigue PM, Kittles RA: Skin pigmentation, biogeographical ancestry and admixture mapping. Hum Genet. 2003, 112 (4): 387-399.

    PubMed  Google Scholar 

  17. Bonilla C, Parra EJ, Pfaff CL, Dios S, Marshall JA, Hamman RF, Ferrell RE, Hoggart CL, McKeigue PM, Shriver MD: Admixture in the Hispanics of the San Luis Valley, Colorado, and its implications for complex trait gene mapping. Ann Hum Genet. 2004, 68 (Pt 2): 139-153. 10.1046/j.1529-8817.2003.00084.x.

    Article  CAS  PubMed  Google Scholar 

  18. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD: Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 2002, 12 (12): 1805-1814. 10.1101/gr.631202.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Smith MW, Patterson N, Lautenberger JA, Truelove AL, McDonald GJ, Waliszewska A, Kessing BD, Malasky MJ, Scafe C, Le E, De Jager PL, Mignault AA, Yi Z, De The G, Essex M, Sankale JL, Moore JH, Poku K, Phair JP, Goedert JJ, Vlahov D, Williams SM, Tishkoff SA, Winkler CA, De La Vega FM, Woodage T, Sninsky JJ, Hafler DA, Altshuler D, Gilbert DA, O'Brien SJ, Reich D: A high-density admixture map for disease gene discovery in african americans. Am J Hum Genet. 2004, 74 (5): 1001-1013. 10.1086/420856.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B. 1977, 39: 1-38.

    Google Scholar 

  21. Jiang R, Duan J, Windemuth A, Stephens JC, Judson R, Xu C: Genome-wide evaluation of the public SNP databases. Pharmacogenomics. 2003, 4 (6): 779-789. 10.1517/phgs.4.6.779.22821.

    Article  CAS  PubMed  Google Scholar 

  22. Miller RD, Phillips MS, Jo I, Donaldson MA, Studebaker JF, Addleman N, Alfisi SV, Ankener WM, Bhatti HA, Callahan CE, Carey BJ, Conley CL, Cyr JM, Derohannessian V, Donaldson RA, Elosua C, Ford SE, Forman AM, Gelfand CA, Grecco NM, Gutendorf SM, Hock CR, Hozza MJ, Hur S, In SM, Jackson DL, Jo SA, Jung SC, Kim S, Kimm K, Kloss EF, Koboldt DC, Kuebler JM, Kuo FS, Lathrop JA, Lee JK, Leis KL, Livingston SA, Lovins EG, Lundy ML, Maggan S, Minton M, Mockler MA, Morris DW, Nachtman EP, Oh B, Park C, Park CW, Pavelka N, Perkins AB, Restine SL, Sachidanandam R, Reinhart AJ, Scott KE, Shah GJ, Tate JM, Varde SA, Walters A, White JR, Yoo YK, Lee JE, Boyce-Jacino MT, Kwok PY: High-density single-nucleotide polymorphism maps of the human genome. Genomics. 2005, 86 (2): 117-126. 10.1016/j.ygeno.2005.04.012.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Dvornyk V, Long JR, Xiong DH, Liu PY, Zhao LJ, Shen H, Zhang YY, Liu YJ, Rocha-Sanchez S, Xiao P, Recker RR, Deng HW: Current limitations of SNP data from the public domain for studies of complex disorders: a test for ten candidate genes for obesity and osteoporosis. BMC Genet. 2004, 5: 4-10.1186/1471-2156-5-4.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kim SK, Borevitz J: Mining the HapMap to dissect complex traits. Genome Biol. 2006, 7 (3): 310-10.1186/gb-2006-7-3-310.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Salas A, Phillips C, Carracedo A: Ancestry vs physical traits: the search for ancestry informative markers (AIMs). Int J Legal Med. 2006, 120 (3): 188-189. 10.1007/s00414-005-0032-3. author reply 190.

    Article  CAS  PubMed  Google Scholar 

  26. Altshuler D, Brooks LD, Chakravarti A: A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-1320. 10.1038/nature04226.

    Article  Google Scholar 

  27. Thorisson GA, Smith AV, Krishnan L, Stein LD: The International HapMap Project Web site. Genome Res. 2005, 15 (11): 1592-1593. 10.1101/gr.4413105.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Motsinger AA, Ritchie MD, Dobrin SE: Clinical applications of whole-genome association studies: future applications at the bedside. Expert Rev Mol Diagn. 2006, 6 (4): 551-565. 10.1586/14737159.6.4.551.

    Article  CAS  PubMed  Google Scholar 

  29. SAS I: SAS ver. 9.1.3. 2002, Cary, NC: SAS Institute Inc

    Google Scholar 

  30. Price AL, Patterson N, Yu F, Cox DR, Waliszewska A, McDonald GJ, Tandon A, Schirmer C, Neubauer J, Bedoya G, Duque C, Villegas A, Bortolini MC, Salzano FM, Gallo C, Mazzotti G, Tello-Ruiz M, Riba L, Aguilar-Salinas CA, Canizales-Quinteros S, Menjivar M, Klitz W, Henderson B, Haiman CA, Winkler C, Tusie-Luna T, Ruiz-Linares A, Reich D: A genomewide admixture map for Latino populations. Am J Hum Genet. 2007, 80 (6): 1024-1036. 10.1086/518313.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Myers S, Bottolo L, Freeman C, McVean G, Donnelly P: A fine-scale map of recombination rates and hotspots across the human genome. Science. 2005, 310 (5746): 321-324. 10.1126/science.1117196.

    Article  CAS  PubMed  Google Scholar 

  32. McKeigue PM: Mapping genes that underlie ethnic differences in disease risk: methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am J Hum Genet. 1998, 63 (1): 241-251. 10.1086/301908.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations. Science. 2002, 298 (5602): 2381-2385. 10.1126/science.1078311.

    Article  CAS  PubMed  Google Scholar 

  34. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74 (1): 106-120. 10.1086/381000.

    Article  CAS  PubMed  Google Scholar 

  35. Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296 (5576): 2225-2229. 10.1126/science.1069424.

    Article  CAS  PubMed  Google Scholar 

  36. Wacholder S, Rothman N, Caporaso N: Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. Cancer Epidemiol Biomarkers Prev. 2002, 11 (6): 513-520.

    PubMed  Google Scholar 

  37. Seldin MF, Shigeta R, Villoslada P, Selmi C, Tuomilehto J, Silva G, Belmont JW, Klareskog L, Gregersen PK: European population substructure: clustering of northern and southern populations. PLoS Genet. 2006, 2 (9): e143-10.1371/journal.pgen.0020143.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD: Measuring European population stratification with microarray genotype data. Am J Hum Genet. 2007, 80 (5): 948-956. 10.1086/513477.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Reich DE, Lander ES: On the allelic spectrum of human disease. Trends Genet. 2001, 17 (9): 502-510. 10.1016/S0168-9525(01)02410-6.

    Article  CAS  PubMed  Google Scholar 

  40. Barnes MR: Navigating the HapMap. Brief Bioinform. 2006, 7 (3): 211-224. 10.1093/bib/bbl021.

    Article  CAS  PubMed  Google Scholar 

Download references


This work was supported by the National Institutes of Health grants, R01NS045934, R01MH066181, P30DK056336, R01GM077490, R01DK52431 and R21LM008791. The opinions expressed are those of the authors and not necessarily those of the NIH or any other organization which with the authors are affiliated.

Electronic Database Information

URL for lists of AIM panels extracted from HapMap, Affymetrix, and Illumina are given:

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tesfaye M Baye.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TMB and RCG conceived the study, and TMB carried out the data mining approaches and drafted the manuscript. HKT and DBA critically commented on the manuscript. All authors read and approved the final manuscript.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Baye, T.M., Tiwari, H.K., Allison, D.B. et al. Database mining for selection of SNP markers useful in admixture mapping. BioData Mining 2, 1 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: