Multifactor dimensionality reduction analysis identifies specific nucleotide patterns promoting genetic polymorphisms
© Arehart et al; licensee BioMed Central Ltd. 2009
Received: 21 May 2008
Accepted: 30 March 2009
Published: 30 March 2009
The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals. A new theory regarding DNA replication fidelity has emerged in which selectivity is governed by base-pair geometry through interactions between the selected nucleotide, the complementary strand, and the polymerase active site. We hypothesize that specific nucleotide combinations in the flanking regions of SNP fragments are associated with mutation.
We modeled the relationship between DNA sequence and observed polymorphisms using the novel multifactor dimensionality reduction (MDR) approach. MDR was originally developed to detect synergistic interactions between multiple SNPs that are predictive of disease susceptibility. We initially assembled data from the Broad Institute as a pilot test for the hypothesis that flanking region patterns associate with mutagenesis (n = 2194). We then confirmed and expanded our inquiry with human SNPs within coding regions and their flanking sequences collected from the National Center for Biotechnology Information (NCBI) database (n = 29967) and a control set of sequences (coding region) not associated with SNP sites randomly selected from the NCBI database (n = 29967). We discovered seven flanking region pattern associations in the Broad dataset which reached a minimum significance level of p ≤ 0.05. Significant models (p << 0.001) were detected for each SNP type examined in the larger NCBI dataset. Importantly, the flanking region models were elongated or truncated depending on the nucleotide change. Additionally, nucleotide distributions differed significantly at motif sites relative to the type of variation observed. The MDR approach effectively discerned specific sites within the flanking regions of observed SNPs and their respective identities, supporting the collective contribution of these sites to SNP genesis.
The present study represents the first use of this computational methodology for modeling nonlinear patterns in molecular genetics. MDR was able to identify distinct nucleotide patterning around sites of mutations dependent upon the observed nucleotide change. We discovered one flanking region set that included five nucleotides clustered around a specific type of SNP site. Based on the strongly associated patterns identified in this study, it may become possible to scan genomic databases for such clustering of nucleotides in order to predict likely sites of future SNPs, and even the type of polymorphism most likely to occur.
The fidelity of DNA replication serves as the nidus for both genetic evolution and genomic instability fostering disease. Our knowledge of these processes requires an understanding of polymerase fidelity and the means by which genes are faithfully copied, proofread, and maintained in the face of environmental factors. Single nucleotide polymorphisms (SNPs) constitute greater than 80% of the genetic variation between individuals with such alterations observed every 1000 to 2000 nucleotides when comparing two human gene sequences within the genome . Initially, Watson and Crick proposed that hydrogen bonding between complementary bases secured accurate DNA replication. However, abundant evidence indicates the that free energy differences between correct and incorrect base pairing is not enough to account for the observed selectivity of most DNA polymerases. A new theory has emerged in which selectivity is governed by base-pair geometry through interactions between the selected nucleotide, the complementary strand, and the polymerase active site.
Accurate DNA replication is therefore governed by correct nucleotide insertion, and in the case of polymerases with proofreading ability, by the favored extension of correctly paired complementary strands. Kunkel and others proposed an "induced-fit" model for nucleotide selection where the incoming nucleotide moves the polymerase from an open to a closed configuration [3, 6–8]. In this reaction mechanism, DNA polymerase binds DNA forming an "open-state" complex enabling 2'-deoxyribonucleoside 5'-triphosphate (dNTP) binding in an open ternary complex. A conformational change to the "closed-state" follows dNTP incorporation to the primer 3'-terminus. The polymerase subsequently returns to an open-state, releasing pyrophosphate (PPi) [8–10].
The overall structure of DNA polymerases is comparable to a right hand with palm, finger, and thumb domains . Structural studies have shown that the templating strand bends upon exiting the polymerase catalytic site. This allows the finger domain to interact with the minor groove of the elongating strands, thereby reading the conformation of downstream base-pairing. Also, the templating strand diverts the next template base from the active site, fostering correct template reading on the part of the polymerase. Not surprisingly, polymerase amino acid side-chain interactions play a critical role in efficiency and fidelity. In the case of polymerase β (pol β), Wilson and colleagues found that Asp-276 and Lys-280 form stacking interactions with the incoming nucleotide and the template, with their deletion reducing catalytic efficiency and accuracy. Earlier work had shown loss of Arg-283 hydrogen bonding and van der Waals interactions with the minor groove of the templating nucleotide of the nascent base pair decreases catalysis and reduces polymerase fidelity [14–16].
Other investigators have studied the association of flanking regions on polymerase fidelity. Zhao and Boerwinkle examined neighboring-nucleotide effects on SNP genesis in the human genome and found a bias in regard to nucleotide identity in the flanking regions Their work examined the proportion of each nucleotide neighboring the polymorphic site and found a large bias relative to the averages found in the human genome. This was particularly the case for positions immediately bordering the polymorphic site. Importantly, their work identified distinct bias patterns for differing transition and transversion types, as well as a bias relative to chromosome number. Subsequent work by Zhang and Zhao found neighboring-nucleotide bias in the mouse genome when compared with human SNPs Our work presented here offers a more thorough picture of nucleotide bias in the flanking region of polymorphic sequences. These findings are novel in that they address synergistic interactions between nucleotide positions and the polymorphic site, adding considerable detail regarding the flanking region nucleotide patterns associated with specific transitions and transversions.
To account for the possibility of nonadditive interactions among sequences, we utilize MDR methodology developed specifically for detecting nonlinear patterns of discrete attributes predictive of a discrete endpoint. The MDR method, and associated software, was originally developed to detect interactions among genetic variations in population-based studies of disease susceptibility [19–23]. The goal of MDR is to change the representation space of the data to make nonlinear interactions easier to detect and characterize. Thus, MDR can be seen in a broader sense as a data processing step preceding classification. At the heart of MDR is an attribute construction algorithm, pooling levels from multiple discrete factors to create a new discrete attribute.
We hypothesize that certain sequence combinations in the flanking regions of SNPs predispose toward mutation due to effects on primer strand geometry within the polymerase active site and interactions with side-chains essential for proper catalytic function, possibly altering solvation dynamics within the active site. The goal of the present study is to identify nucleotide patterns in SNP flanking regions that predispose to mutation. To accomplish this goal, we employed a novel machine learning method, multifactor dimensionality reduction (MDR), capable of identifying nonlinear patterns among discrete attributes (nucleotides) and discrete endpoints (mutation type). We found both common and unique nucleotide patterns in the flanking regions of various polymorphism types and delineated detailed associations indicative of neighboring-nucleotide effects.
The goal of this approach is to identify combinations of nucleotides predictive of mutation type. Defining a new attribute as a function of two or more other attributes is referred to as constructive induction, or attribute construction, and was first described by Michalski et al. Constructive induction using MDR is accomplished in the following way: given a threshold T, a combination of levels from two or more attributes, for example, is considered 'associated' with the class of interest if the ratio of class A to class B exceeds T; otherwise it is considered "not associated'. Once multifactor level combinations are labeled 'associated' and 'not associated' a new binary attribute is created with those two levels. Here, the classes are SNP(+) and SNP(-), with each attribute representing the nucleotide at a specific position in the flanking sequences.
To further test our findings we employed chi square (X2) analysis to each of the datasets. In general, chi square testing demonstrated a robust level of significance (often with p values below the 0.0001 level) far greater than that found using the 1000-fold permutation testing approach. This is not surprising when one considers that 1000-fold permutation testing examines the predictive power of the complete flanking region model rather than each nucleotide position within that model separately.
Broad Institute Dataset
Broad Dataset with Flanking Positions Identified by A, G, C, and T Character
Broad Dataset with Flanking Positions Identified as either purine or pyrimidine
Among all combinations of the two classes, a single model for the high risk group is constructed with the best SNP+/SNP- ratio. Single best multifactoral models are selected for each of the 2n-factor combinations. Then, the model with the best predictive power, having the lowest prediction error is selected. The final multifactorial model is thus selected from the classification errors and prediction errors. Statistical significance was ascertained by comparing the average cross-validation consistency of the SNP(+) sets to the value of consistencies of the SNP(-) sets (the null groups) derived from 1,000 permutations. The null hypothesis was rejected when the upper value of the Monte Carlo P value derived from the permutation test was = 0.05. MDR computation methods have been used previously with good success in analyzing epistatic models of disease where multiple genes interact with one another in the disease model.
NCBI Dataset Distributions
% of Cases
A or C
A or G
A or T
C or G
C or T
G or T
Our NCBI data set was initially composed of 920,181 human sequences with varying SNP character. Sequences were downloaded as follows: The query, (((("homosapiens" [Organism] AND "true" [Genotype]) AND (("coding nonsynonymous" [Function class] OR "intron" [Function class]) OR "coding synonymous" [Function class])) AND "sequence" [METHODCLASS]) AND "snp" [SnpClass]), was performed on November 2, 2006 using dbSNP build 126. The resulting 920,181 records were collected in FASTA format for post-query parsing using a series of in-house developed Perl scripts. The initial records were later pruned to 29,967 due to inconsistencies in the original dataset. The first 20 nucleic acids of each sequence became an unmatched control sequence, with the requirement that control strands contain no characters other than A/a, C/c, T/t, or G/g. The 10 nucleic acids immediately flanking the identified SNP site were extracted as a case sequence. Additionally, flanking regions including 20 nucleic acids in each direction were extracted, but demonstrated no pattern association differences from the 10 nucleic acid strands. Case and control sequences were collated into tab-delimited MDR input files, with sequences labeled as case (1) or control (0), according to MDR system specifications.
Results and discussion
It has been previously determined that certain replication errors are influenced by flanking regions adjacent to the mutation site. Small frameshifts of one-base deletions are made on undamaged DNA by DNA pol μ, pol λ, pol β, and Escherichia coli pol IV. One such example is Streisinger slippage, resulting in simple deletions by a process of looping out of one or more bases as the primer moves along a strand of reiterated template bases. This mechanism plays a role in trinucleotide expansion seen in Huntington disease, Fragile X syndrome, and Myotonic dystrophy to name a few. Other work regarding HIV Type1 reverse transcriptase (RT) found that RT side chain interactions affected polymerase fidelity and specifically that correct T-dAMP insertion was affected by the 5'-CT GG primer sequence in the binding pocket. These studies were performed to evaluate potential independent effects of sites within the flanking regions as well as synergistic interactions between sites. Therefore, knowledge of the potential for nucleotide clusters to predispose some genomic sites to spontaneous mutation offers enormous benefit in the study of viral and bacterial mutation leading to drug resistance as well as the identification of potential pre-cancerous genetic lesions and genomic instability leading to human developmental diseases. To our knowledge, this is the first instance of MDR methods employed to evaluate the potential role of flanking regions on mutagenesis.
We tested the hypothesis that specific types of replication errors (changes to and from each combination of adenine, cytosine, guanine, or thymine) would be associated within distinct flanking region patterns. Sequences were refined from both the Broad Institute database followed by the NCBI dBSNP as described in methods. The Broad dataset, although relatively small, provides directionality of nucleotide change and would serve as an ideal pilot set to test the power of MDR. The larger NCBI set could then used to confirm, expand, and refine the identified models.
Broad Institute Dataset
The Broad Institute dataset represented a small collection of sequences (n = 2194) compared to the larger NCBI dataset (n = 29,967) and was chosen as a pilot study to evaluate the application of MDR methodology to flanking region pattern associations with single nucleotide polymorphisms. Each position in the flanking region was identified by its specific nucleotide, generating four distinct models with positions identified as A, G, C, or T. When analyzing data sets with the nucleotide identity at each position, we discovered four SNP models that reached significance (Table 1). As we will shortly describe, four models in the Broad dataset (Table 1) were confirmed in the larger NCBI dataset. Analyzing each position as purine or pyrimidine, rather than by specific nucleotide identity, three models reached significance (Table 2). Where either G or A is observed at the SNP site, position +1 was again significantly associated with SNP occurrence (p = 0.02). This was consistent with results for the same dataset when positions were identified as A, G, C, or T. The Y model (C or T) showed position +2 to be significantly associated with SNP genesis of this type at the p < 0.01 level. And the K model (G or T) demonstrated positions -2 and +2 to be significantly associated with occurrence of the T/G polymorphism, p = 0.02.
To further investigate the possible contribution of flanking regions to SNP mutation sites, we classified each of the sequence positions with regard to their purine or pyrimidine identity (Table 2). This was done to explore the role of pyrimidine/purine template strand content previously found to play a role in the catalytic efficiency and fidelity of pol β and may play a role in the fidelity rate of other polymerase. The same data was employed as before, with the distinction of labeling each of the ten flanking positions in the 3' and 5' direction as either purine or pyrimidine and then performing the same MDR methods as stated above.
When we examined the Broad dataset identifying the positions in the flanking regions only by their purine or pyrimidine identity we also discovered an overlapping of flanking region sets seen previously in the NCBI dataset. In this instance, the R (A or G) model again indicated significance at position +1 (p < 0.02). Nucleotide position +1 was also found to have significant association with the S (C or G) and K (G or T) polymorphism-type models in the Broad Institute dataset. The Y (C or T) model favors nucleotide position +2 in its motif. In the larger NCBI dataset, position +2 is included in the motif only for the Y (C or T) model, but not for the W (A or T) and K (G or T) models. Also in the Broad dataset, the K (G or T) model includes nucleotide positions -2 and +2 (p < 0.02), whereas in the NCBI dataset, only the Y (C or T) and R (A or G) flanking region sets include positions -2 and +2.
NCBI Transition and Transversion Distributions in Exonic and Intronic Sequences
% of Total
Total # Records
NCBI Dataset Motifs
SNP Model *
P << 0.001
-1, +1, +2
P << 0.001
-1, +1, +2
P << 0.001
-1, +1, +2
P << 0.001
-1, +1, +2, +3
P << 0.001
-2, -1, +1, +2, +3
P << 0.001
Flanking nucleotide distribution for each SNP-type
In addition to permutation testing for these models we also performed X2 as an alternative method. All models demonstrated significance below or at the p = 0.001 level. This is not surprising given X2 tendency to over predict synergistic models . Ultimately, MDR takes a more conservative approach to significance testing due to the requirement that the models are tested as a unit rather than as individual contributions to the model.
Our analysis of two datasets has shown the existence of neighboring nucleotide patterns that persist across identified single-nucleotide polymorphism (SNP) models. Importantly, this pattern grows to include additional nucleotide positions depending on the type of polymorphism observed. We have also found that for a given SNP type, the distribution of nucleotides within the flanking region shows specificity in association with certain SNP types. Comparison of the Broad Institute dataset with the NCBI dataset demonstrated an overlap in flanking region sets and provided some directionality with regard to SNP genesis. Such studies will allow for the development of more powerful and predictive algorithms offering the possibility of predicting both occurrence and direction of SNP genesis in vivo.
- Stoneking M: Single nucleotide polymorphisms. From the evolutionary past. Nature. 2001, 409: 821-2. 10.1038/35057279.View ArticlePubMed
- Watson JD, Crick FH: Molecular structure of nucleic acids; a structure for deoxyribose nucleic acid. Nature. 1953, 171: 737-8. 10.1038/171737a0.View ArticlePubMed
- Kunkel TA, Bebenek K: DNA replication fidelity. Annu Rev Biochem. 2000, 69: 497-529. 10.1146/annurev.biochem.69.1.497.View ArticlePubMed
- Engel JD, von Hippel PH: Effects of methylation on the stability of nucleic acid conformations. Studies at the polymer level. J Biol Chem. 1978, 253: 927-34.PubMed
- Lewis DA, Bebenek K, Beard WA, Wilson SH, Kunkel TA: Uniquely altered DNA replication fidelity conferred by an amino acid change in the nucleotide binding pocket of human immunodeficiency virus type 1 reverse transcriptase. J Biol Chem. 1999, 274: 32924-30. 10.1074/jbc.274.46.32924.View ArticlePubMed
- Arndt JW, Gong W, Zhong X, Showalter AK, Liu J, Dunlap CA, Lin Z, Paxson C, Tsai MD, Chan MK: Insight into the catalytic mechanism of DNA polymerase beta: structures of intermediate complexes. Biochemistry. 2001, 40: 5368-75. 10.1021/bi002176j.View ArticlePubMed
- Beard WA, Shock DD, Vande Berg BJ, Wilson SH: Efficiency of correct nucleotide insertion governs DNA polymerase fidelity. J Biol Chem. 2002, 277: 47393-8. 10.1074/jbc.M210036200.View ArticlePubMed
- Yang L, Arora K, Beard WA, Wilson SH, Schlick T: Critical role of magnesium ions in DNA polymerase beta's closing and active site assembly. J Am Chem Soc. 2004, 126: 8441-53. 10.1021/ja049412o.View ArticlePubMed
- Beard WA, Wilson SH: Structural insights into DNA polymerase beta fidelity: hold tight if you want it right. Chem Biol. 1998, 5: R7-13. 10.1016/S1074-5521(98)90081-3.View ArticlePubMed
- Sawaya MR, Prasad R, Wilson SH, Kraut J, Pelletier H: Crystal structures of human DNA polymerase beta complexed with gapped and nicked DNA: evidence for an induced fit mechanism. Biochemistry. 1997, 36: 11205-15. 10.1021/bi9703812.View ArticlePubMed
- Ollis DL, Brick P, Hamlin R, Xuong NG, Steitz TA: Structure of large fragment of Escherichia coli DNA polymerase I complexed with dTMP. Nature. 1985, 313: 762-6. 10.1038/313762a0.View ArticlePubMed
- Osheroff WP, Beard WA, Yin S, Wilson SH, Kunkel TA: Minor groove interactions at the DNA polymerase beta active site modulate single-base deletion error rates. J Biol Chem. 2000, 275: 28033-8.PubMed
- Beard WA, Shock DD, Yang XP, DeLauder SF, Wilson SH: Loss of DNA polymerase beta stacking interactions with templating purines, but not pyrimidines, alters catalytic efficiency and fidelity. J Biol Chem. 2002, 277: 8235-42. 10.1074/jbc.M107286200.View ArticlePubMed
- Osheroff WP, Beard WA, Wilson SH, Kunkel TA: Base substitution specificity of DNA polymerase beta depends on interactions in the DNA minor groove. J Biol Chem. 1999, 274: 20749-52. 10.1074/jbc.274.30.20749.View ArticlePubMed
- Ahn J, Werneburg BG, Tsai MD: DNA polymerase beta: structure-fidelity relationship from Pre-steady-state kinetic analyses of all possible correct and incorrect base pairs for wild type and R283A mutant. Biochemistry. 1997, 36: 1100-7. 10.1021/bi961653o.View ArticlePubMed
- Beard WA, Osheroff WP, Prasad R, Sawaya MR, Jaju M, Wood TG, Kraut J, Kunkel TA, Wilson SH: Enzyme-DNA interactions required for efficient nucleotide incorporation and discrimination in human DNA polymerase beta. J Biol Chem. 1996, 271: 12141-4. 10.1074/jbc.271.21.12141.View ArticlePubMed
- Zhao Z, Boerwinkle E: Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 2002, 12: 1679-86. 10.1101/gr.287302.PubMed CentralView ArticlePubMed
- Zhao Z, Zhang F: Sequence context analysis in the mouse genome: single nucleotide polymorphisms and CpG island sequences. Genomics. 2006, 87: 68-74. 10.1016/j.ygeno.2005.09.012.View ArticlePubMed
- Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138-47. 10.1086/321276.PubMed CentralView ArticlePubMed
- Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003, 24: 150-7. 10.1002/gepi.10218.View ArticlePubMed
- Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003, 19: 376-82. 10.1093/bioinformatics/btf869.View ArticlePubMed
- Hahn LW, Moore JH: Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol. 2004, 4: 183-94.PubMed
- Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006, 241: 252-61. 10.1016/j.jtbi.2005.11.036.View ArticlePubMed
- Michalski RS, Baskin AB, Spackman KA: A logic-based approach to conceptual data base analysis. Med Inform (Lond). 1983, 8: 187-95.View Article
- Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH: A novel method to identify gene-gene effects in nuclear families: the MDR-PDT. Genet Epidemiol. 2006, 30: 111-23. 10.1002/gepi.20128.View ArticlePubMed
- Velez S, Feder JL: Integrating biogeographic and genetic approaches to investigate the history of bioluminescent colour alleles in the Jamaican click beetle, Pyrophorus plagiophthalamus. Mol Ecol. 2006, 15: 1393-404. 10.1111/j.1365-294X.2005.02793.x.View ArticlePubMed
- Tippin B, Kobayashi S, Bertram JG, Goodman MF: To slip or skip, visualizing frameshift mutation dynamics for error-prone DNA polymerases. J Biol Chem. 2004, 279: 45360-8. 10.1074/jbc.M408600200.View ArticlePubMed
- Efrati E, Tocco G, Eritja R, Wilson SH, Goodman MF: Abasic translesion synthesis by DNA polymerase beta violates the "A-rule". Novel types of nucleotide incorporation by human DNA polymerase beta at an abasic lesion in different sequence contexts. J Biol Chem. 1997, 272: 2559-69. 10.1074/jbc.272.4.2559.View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.