Database mining for selection of SNP markers useful in admixture mapping

Background New technologies make it possible for the first time to genotype hundreds of thousands of SNPs simultaneously. A wealth of genomic information in the form of publicly available databases is underutilized as a potential resource for uncovering functionally relevant markers underlying complex human traits. Given the huge amount of SNP data available from the annotation of human genetic variation, data mining is a reasonable approach to investigating the number of SNPs that are informative for ancestry information. Methods The distribution and density of SNPs across the genome of African and European populations were extensively investigated by using the HapMap, Affymetrix, and Illumina SNP databases. We exploited these resources by mining the data available from each of these databases to prioritize potential candidate SNPs useful for admixture mapping in complex human diseases and traits. Over 4 million SNPs were compared between Africans and Europeans on the basis of a pre-specified recommended allele frequency difference (delta) value of ≥ 0.3. Results The method identified 15% of HapMap, 11% of Affymetrix, and 14% of Illumina SNP sets as candidate SNPs, termed ancestry informative markers (AIMs). These AIM panels with assigned rs numbers, allele frequencies in each ethnic group, delta value, and map positions are all posted on our website . All marker information in this data set is freely and publicly available without restriction. Conclusion The selected SNP sets represent valuable resources for admixture mapping studies. The overlap between selected AIMs by this single measure of marker informativeness in the different platforms is discussed.


Background
The chromosome of an individual from a recently admixed population such as the African-American population contains large stretches of DNA that resemble mosaics of chromosomal segments [1], each derived from European or African ancestry that have not had sufficient time to break up as a result of recombination. Hence, allelic associations in these populations may extend over distances as large as 20-30 cM [2,3]. Methods to map genes that rely on admixture may therefore require fewer markers to screen the genome than would other approaches for mapping complex disease genes [4,5].
Theoretically, any marker [6][7][8][9][10] that has an allele frequency difference between ancestral populations, known as ancestry informative markers (AIMs), can be used for admixture mapping. Such markers can also be used to control for population confounding by variations in background ancestry via structural association testing (SAT) [11]. The ideal AIM has one allele that is monomorphic in one population (p = 1.0) and that is not present in another [12]. However, most alleles are shared among populations [13][14][15]. Hence, it is important to identify and choose informative AIMs across populations [16].
Several single nucleotide polymorphism (SNP) panels have been reported over the past few years [7,8,[16][17][18][19] with a focus on identifying markers suitable for admixture studies. Smith et al. [9] screened 744 microsatellite markers for AIMs in 4 different populations and identified a genome spanning set of 315 markers (average spacing 10 cM, frequency difference > 0.3) for mapping in African-Americans and 214 markers (average spacing of 16 cM, frequency difference > 0.25) for mapping in Hispanics. Ninety-seven AIMs were identified for mapping in African-American populations that show limited variation within Africa [10].
Recently 3011 SNP AIMs were reported for studying African-American populations [19], who have an average of 80% African and 20% European ancestry, after screening 450,000 SNPs for which allele frequencies were available. This panel is considered the gold standard for admixture mapping in this population. However, the SNPs used to develop these AIMs came mostly from African-American (98.6%, over 443,916 SNPs) populations, and the ancestral West African frequencies were inferred or estimated by using the expectation-maximization (EM) algorithm [20] rather than by being directly measured.
To date, only a limited amount of information characterizing SNPs across the human genome [21,22] for the majority of ethnic groups is found in the literature [23]. Consequently, mining of SNP frequencies from HapMap and other genomic data sets including Affymetrix 500 K and Illumina 100 K SNPs with an ethnic-dependent background across the genome is an economical, rapid, and practical strategy for developing a more comprehensive and informative panel of AIMs [19,24]. This may result in a uniform resource that describes nucleotide diversity with sufficient power to infer ancestry for admixed populations [25], beyond the currently available lists of AIMs. The objectives of the present study were to mine databases and develop AIM panels useful in admixture mapping and compare the selected set of AIMs with the widely used AIM panels.

Materials
SNP markers deposited by the HapMap project, 500 K Affymetrix, 100 K Illumina, and the recently published 3011AIM SNP panels for all autosomal and sex chromosomes were used to determine AIMs. The distribution of SNPs in each chromosome and database is shown in Table 1.
The Affymetrix 500 K was downloaded from http:// www.affymetrix.com. The Affymetrix 500 K array sets contain "quasi-random" or anonymous SNPs that are spread evenly across the genome and are selected on the basis of information content and could lie between genes. These SNPs were developed for genome-wide association and fine mapping studies. Each of these data sets, which differ in the way the SNPs were selected [28], has characteristics that make it useful for the current investigation. The HapMap offers an extensive collection of SNPs across ancestral population genomes; the Affymetrix 500 K is a comprehensive widely used chip; the Illumina 100 K has a gene-centric focus; and the AIM panel is the current gold standard SNP panel used in admixture mapping.

Data analysis
A computer program using Python http:// www.python.org was written to export and pre-process the SNP information from the HapMap databases (the codes are available upon request). A SAS [29] program was used to analyze the data. We used 3 criteria to select the markers to be considered in our analysis: (1) the SNP should be shared between the 2 ancestral populations, (2) a specific marker is retained if it has a delta-value (i.e., the allele frequency difference between 2 parental populations) of 0.3 or higher (a cutoff that has been suggested for AIMs [10], and (3) the physical distance between consecutive selected SNPs must be at least 0.3 cM to avoid the probability of choosing 2 redundant SNPs that are in strong LD [30,31]. It is expected that markers that are sufficiently spaced throughout the genome will offer independent information about genetic background or ancestry. In each of the 0.3 cM bin, AIMs with the highest delta value were selected to maximize information content of ancestry.
Several methods for measuring marker informativeness for ancestry have been developed and discussed by Rosenberg et al. [12] and others [19,32]. However, the absolute allele frequency difference (delta) is the most commonly used measure of informativeness for ancestry between 2 parental populations [12]. Marker informativeness for ancestry can be ascertained through the absolute value of the difference in the frequency of a particular allele observed for 2 ancestral populations. If we let p 11 represent the frequency of a reference allele in the first parental population and p 21 the frequency of the same allele in the second parental population, then the delta value is given by = |p 11 -p 12 |. A marker with a delta value of 1 provides perfect information regarding its ancestry, whereas a marker with a delta value of 0 carries no information for ancestry.

SNP allele frequency characterization, racial variation, and databases
Of the total HapMap SNPs for which both Yoruban and CEU allele frequencies were available, we extracted all the monomorphic SNPs and SNPs with various levels of polymorphism, including 100% informative SNPs between the ancestral populations. Table 2 compares the allele frequency distributions under each scenario of the different databases and shows that there is a slight increase in the proportion of rare variation in the Affymetrix and Illumina groups. From the characterized HapMap, Affymetrix, and Illumina SNP databases, 17.3%, 2.6%, and 1.3%, respectively, were 100% noninformative for ancestry.
A summary of the interpopulation differences using the HapMap databases shows that a total of only 30 of the interpopulation marker comparisons had very large frequency differences or 100% informative for ancestry (delta = 1) between the 2 ancestral groups ( Table 2). The few 100% informative SNPs for ancestry in these findings are consistent with prior studies [33,34], showing that most DNA variation is shared among human populations.
Using a prespecified recommended allele frequency difference (delta) value of ≥ 0.3, on the average across the databases and genome, 15% of HapMap, 19% of Affymetrix, and 15% of Illumina SNP sets were AIMs (Table 3). However, only 15507 (0.42%) HapMap SNPs had an allele frequency difference of 0.7 and above. Similar to the case with CEU, there were large discrepancies in allele frequencies between SNP data for Yoruban populations from the different databases. For example, the reported allele frequencies of the rs55543 SNP from the HapMap, Affymetrix, and Illumina databases were 0.34, 0.31, and 0.42 generated from sample sizes of 120, 48, and 60 samples, respectively. We suspect that the differences in SNP allele frequency data in the different databases were likely due to small sample sizes and respective large sampling errors of the estimates as suggested by Dvornyk et al. [23]. The SNP AIM characteristics with assigned rs numbers, allele frequencies in each ethnic group, delta value, and map positions are all posted on our website http:// www.ssg.uab.edu/downloads/admixture_mapping/ SNPAIMs.txt. All marker information in this data set is freely and publicly available without restriction.   (Table 4). This is not surprising because the SNP selection criteria for each platform differed. For example, Affymetrix SNPs are based on proximity to a restriction site and even distribution across the genome, whereas the Illumina platform SNPs are selected in gene-rich regions and thus are not evenly distributed across the genome [28]. Combining nonoverlapping SNPs from different platforms seems a viable approach to increase power and detect signals across the genome.
However, most SNPs are not fixed among ancestral populations and so we cannot rule out the chance that the delta measures of informativeness pick different markers in the different platforms. Moreover, the average sample size (number of individuals) or DNA samples in each of the 2 populations used to estimate allele frequencies and the laboratory procedures used vary between platforms. For instance HapMap data were based on 120 samples, Affymetrix was based on 48 samples, and Illumina used 60 samples. Hence, we believe that the selected SNPs that are present in at least 2 platforms could be considered to be the best candidates for admixture mapping.

Private SNP data set
We observed significant differences in allele frequencies of few SNPs in the present study. These SNPs with significant variation in allele frequencies in populations of different ethnicity may be appropriate for studying the genetic basis of between-ethnic differences in the rates of complex diseases. Although the small sample sizes in this study preclude any definite conclusion regarding the complete absence of a particular allele in any given population, we observed 30 HapMap SNPs (0.001%) that were segregating in only one population sample ("private SNPs"). Most of these private SNPs (77%) were segregating in the African sample, although private SNPs were also observed for European populations. This may owe to the fact that African populations harbor more unique polymorphic alleles than non-African populations [35]. Follow-up studies of  the highly differentiated regions might provide significant insight into phenotypic diversity, selection and local adaptation between populations. No private SNPs were observed in the Affymetrix and Illumina data sets.

Discussion
The SNP databases are important resources for performing genetic linkage, association, and admixture studies. Both academic and commercial groups are developing large numbers of genome-wide SNP datasets. These databases now contain over 12.6 million SNPs. However, only a small fraction of these SNPs are well characterized and validated [21]. Users of these data sets have several common questions regarding the existing databases, including the following: What is the frequency spectrum of the SNPs in these databases? What is the distribution picture of these SNPs across different ethnic and geographic populations? What fraction of the total number of SNPs is already captured by these databases?
We mined and compared the HapMap SNP database against Affymetrix 500 K and the gene centric Illumina 100 K SNP chips. This comparison suggests that a relatively large fraction (> 80%) of SNPs in these databases do not meet the cutoff for acceptable markers as AIMs [10], which means that they are either of very low frequency or not ancestry informative between the 2 ancestral populations. As a result, we developed and preset the AIM panels for each database individually. Our analyses showed that the SNP databases in their current status might have some limitation for studies of complex disorders, especially in different ethnic groups, as a result of incomplete or uneven representation of SNPs along the genome [23]. As indicated above, the different databases have different sets of SNPs. Because the SNP allele frequencies were determined by different genotyping labs that used different sample sizes and genotyping methods (see Methods), it would be difficult to perform several tests to assess data quality and identify sources of experimental variation. In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, many of our analyses relied on data derived from available databases with contents that are, and will continue to be for some time, in a state of change. Moreover, the allele frequencies across the platforms were based on different sets of DNA samples. Therefore, our results represent a snapshot based on currently available data, and ultimately, when the human genome annotation becomes more stable, it will be important to verify these results. Second, the SNP allele frequencies were determined by using relatively small sample sizes (see Methods), and stochastic variation could affect the robustness of our conclusions.
Several studies discussed the similarities between human populations in terms of genetic constituents, and hence a large sample size may enable the detection of small differences in rare outcomes. Although we observed a strong correlation in allele frequencies between SNPs from different platforms (data not shown), confirming these allele frequency estimates in a larger sample size will be important. The analytical caveats associated with each database, such as how surrogates are Yorubans or CEU to each ancestral population and how much of the data (for example, in HapMap) is transferable to the diverse populations in Africa where there is extreme adaptive variation along the various countries is also debatable.
Most studies consider Europe as a relatively homogeneous population. Consequently, it has been argued that European population stratification does not represent a substantial source of bias in epidemiologic studies [36]. However, recent autosomal SNP studies have highlighted significant patterns of structure within Europe along a north-south axis [37] and also the presence of several significant axes of stratification within Europe, most prominently in a northern-southeastern trend, but also along an east-west axis. The study emphasized the importance of considering population stratification in studies using European and European-American individuals, and the need to develop EuroAIMs (European ancestry informative markers) for ancestry estimation and correction [38]. Moreover, the fundamental theorem underpinning Hap-Map is the common disease common variance (CD/CV) hypothesis [39]. How much information we can capture from rare variants is not clear [40].

Conclusion
We presented AIM databases for all SNPs that show promise in distinguishing ancestral populations and thus that will be useful in admixture mapping for finding loci influencing complex phenotypes. These databases will also be useful for controlling stratification (or confounding factors) when the variation in admixture levels among individuals causes false-positive associations in genetic association studies. This investment will result in a unique genetic resource of high quality and global importance for genetic studies in admixed populations. Its size and complexity will allow systematic research into the genetics of many complex disorders in admixed populations and thus, by serving a wide variety of disciplines, will feed research in this promising area for many years to come.