Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions
© Greene et al. 2009
Received: 9 April 2009
Accepted: 22 September 2009
Published: 22 September 2009
Skip to main content
© Greene et al. 2009
Received: 9 April 2009
Accepted: 22 September 2009
Published: 22 September 2009
Genome-wide association studies are becoming the de facto standard in the genetic analysis of common human diseases. Given the complexity and robustness of biological networks such diseases are unlikely to be the result of single points of failure but instead likely arise from the joint failure of two or more interacting components. The hope in genome-wide screens is that these points of failure can be linked to single nucleotide polymorphisms (SNPs) which confer disease susceptibility. Detecting interacting variants that lead to disease in the absence of single-gene effects is difficult however, and methods to exhaustively analyze sets of these variants for interactions are combinatorial in nature thus making them computationally infeasible. Efficient algorithms which can detect interacting SNPs are needed. ReliefF is one such promising algorithm, although it has low success rate for noisy datasets when the interaction effect is small. ReliefF has been paired with an iterative approach, Tuned ReliefF (TuRF), which improves the estimation of weights in noisy data but does not fundamentally change the underlying ReliefF algorithm. To improve the sensitivity of studies using these methods to detect small effects we introduce Spatially Uniform ReliefF (SURF).
SURF's ability to detect interactions in this domain is significantly greater than that of ReliefF. Similarly SURF, in combination with the TuRF strategy significantly outperforms TuRF alone for SNP selection under an epistasis model. It is important to note that this success rate increase does not require an increase in algorithmic complexity and allows for increased success rate, even with the removal of a nuisance parameter from the algorithm.
Researchers performing genetic association studies and aiming to discover gene-gene interactions associated with increased disease susceptibility should use SURF in place of ReliefF. For instance, SURF should be used instead of ReliefF to filter a dataset before an exhaustive MDR analysis. This change increases the ability of a study to detect gene-gene interactions. The SURF algorithm is implemented in the open source Multifactor Dimensionality Reduction (MDR) software package available from http://www.epistasis.org.
Technological advances are rapidly improving geneticists ability to measure variation between individuals. Because of these advances, the genome-wide association study is now a common approach to detecting genetic factors which influence individual susceptibility to common human diseases. Genome-wide association studies targeting common variants which, alone, influence susceptibility have produced mixed results [1–5]. As currently performed, these studies ignore complex interactions between variants that may lead to disease susceptibility. These are often ignored because methods to detect these interactions are computationally infeasible or provide insufficient sensitivity.
Epistasis is a term literally meaning "resting upon" which refers to the situation where interacting genes, as opposed to a single gene, influence a trait. Because of the complex architecture of biological networks, epistasis is likely to be fundamental to an individual's disease risk for common human diseases . This, combined with the knowledge that single-locus results have not frequently replicated for common human diseases [7, 8], indicates that methods to detect and characterize epistasis are likely to be critical to understanding the genetic basis of common human disease.
Detecting and characterizing epistatic interactions in datasets containing large numbers of SNPs is challenging. It requires examining the effect of SNPs not just in isolation, but also in concert with other SNPs. In a dataset with one million SNPs, a number typically provided by high throughput technologies, there are about 5 × 1011 pairwise combinations of SNPs. For three-way combinations, the number is 1.7 × 1017. For higher order interactions the number of combinations is astronomical. Combinatorial methods which evaluate each such combination are not feasible .
Efficient algorithms for identifying sets of SNPs likely to contain predictive models for disease susceptibility are therefore needed. Methods of filtering SNPs are one possibility. These first rank the attributes by some criterion. Then either the top K SNPs or all SNPs above some threshold T are selected. The SNPs within this set can then be analyzed for interactions using combinatorial methods. Stochastic search wrappers are another possibility. These wrappers are probabilistic methods which retain the ability to consider all attributes and have the potential to use information learned early in the search to direct future exploration. Relief algorithms are nearest neighbor based approaches to detecting attributes relevant for some outcome. Relief algorithms are attractive for use in genetic association studies using either filters or wrappers because the computation time required increases linearly with the number of SNPs and quadratically with the number of individuals. Importantly, these algorithms are able to detect attributes associated with disease through interactions or independent main effects, although they do not provide a model for the effect . Instead, information gleaned from these methods can be used as input into other approaches. Stochastic search approaches such as genetic programming [11–13] and ant colony optimization  can successfully develop models in this domain when information from the Relief family of algorithms is used to assist the search, although they fail to detect purely epistatic associations without this additional information . Motsinger et al.  have shown that patterns of correlation between SNPs can make the problem easier to solve in the absence of expert knowledge, although here we specifically examine uncorrelated SNPs. Moore et al. briefly discuss both filter and wrapper options as part of an overall epistasis analysis strategy for human disease susceptibility  and Greene et al.  provide a theoretical analysis of both approaches. For the situation where there is a single source of expert knowledge, the filter approach is most appropriate . In this situation we are considering the success rate of individual Relief methods, each of which is a single source which meets these assumptions up to a good approximation according to the appendix (Additional file 1). For this reason we test the ability of these methods to successfully filter a dataset retaining SNPs with an epistatic interaction associated with disease susceptibility.
Numerous variants of Relief have been developed. When applied to genetic association study data these methods use genetically similar individuals or, equivalently, nearest neighbors to adjust weights which are assigned to each SNP. The nearest neighbor is the nearest individual in the dataset to the current individual calculated across all SNPs. While Relief uses, for each individual, a single nearest neighbor in each class, ReliefF, a variant of Relief, uses multiple nearest neighbors, and thus is more robust when the dataset contains noise . Moore and White developed a Tuned ReliefF (TuRF) approach for human genetics . This approach, though requiring more computer time, further improves the performance when the data contain a large number of non-relevant SNPs in addition to a small number of relevant SNPs. TuRF achieves this by iterating a ReliefF algorithm and, with each iteration, deleting SNPs with the lowest ReliefF weights, i.e. those thought to be least predictive . SNPs are assigned a weight based on their normalized weights when removed. This iterative approach improves the overall ranking of disease associated SNPs because noisy SNPs are most often removed. This means that the re-estimation can more accurately evaluate the relevance of the remaining SNPs.
Here we present a new version of Relief, called Spatially Uniform ReliefF or, briefly, SURF. It detects epistatic interactions with a significantly higher success rate than the Relief variant widely used for machine learning, ReliefF. Iterated SURF, called SURF & TuRF, has a significantly higher success rate than TuRF. For each individual SURF, like ReliefF, adjusts weights of all the SNPs by using certain neighbors of the individual. While ReliefF uses a fixed number of nearest neighbors, SURF uses all neighbors within a fixed distance of the individual. This distance may be thought of as a similarity threshold. Thus SURF uses precisely those neighbors more similar than this threshold. ReliefF, on the other hand, may use either fewer or more neighbors, thereby possibly neglecting informative individuals or including uninformative ones. Furthermore, similarity thresholds which give greater success rate than ReliefF can be estimated from the data while distances are pre-computed, thus removing a nuisance parameter from the algorithm (see §2 in the appendix). SURF also does not increase the complexity of the algorithm, so the scaling is still linear with respect to the number of SNPs and quadratic with respect to the number of individuals.
All Relief algorithms attach a weight to each SNP. The higher the weight of a SNP, the more likely it is predictive of disease status. Genetically similar individuals are used to adjust these SNP weights. We define the distance between two individuals as the number of their SNPs with differing genotypes. With this distance metric, nearest neighbors share genotypes at the greatest number of SNPs, and so are genetically most similar.
Penetrance values for an example epistasis model with a heritability of 0.1.
The individual has the same probability of being sick if he has genotype either Aa or aa, provided again that the genotype of SNP2 is unknown. Similarly, if his genotype is either BB, bB or bb with SNP1's genotypes unknown, the probability he is sick is again .3849. The point is that no single SNP has an effect on disease susceptibility. Only the relevant pair does.
The value of this for the example we have been considering is .005. The expected contribution of individual I i to the score of an irrelevant SNP is 0. This indicates why Relief tends to assign higher scores to relevant SNPs than to irrelevant ones.
The is here since each individual in TM 1 and TH 1 changes the score of a relevant SNP by , on the average.
Returning now to the example model, specifically expressions (1) and (2), we see that |M 2Δ| - |H 2Δ| < 0.
and, consequently, > 0, on average. For the example penetrance function, equation (3) of the appendix gives = .519; however, this SURF score cannot be reasonably compared to the analogous Relief score of .005 without a discussion of the variances of these scores. We do this in the appendix, and also indicate in §5 why SURF outperforms ReliefF using 10 nearest neighbors.
The scores depend on the value of the distance threshold T. In our simulations, we have chosen T to maximize the . The final score of a relevant SNP is the sum of the values for each individual.
Because of the way the variance of this sum varies with T, slightly smaller values of T are probably optimal. This is discussed at the end of §2 of the appendix.
Methods which increase success rate without an increase in computational complexity or sample size are extremely desirable for genome-wide association studies. By developing improved methods for detecting epistasis we greatly expand our ability to characterize interactions in large datasets. Moore argues that when people use sensitive methods to detect epistasis, they are frequently able to find examples of it . Algorithms which both detect and characterize epistasis in the absence of main effects are of combinatorial complexity for the number of SNPs. The SURF algorithm we introduce to detect disease associated interacting SNPs is, like ReliefF, of linear complexity for the number of SNPs. Moreover, its success rate for epistasis analysis is higher than ReliefF's. One caveat is that Relief methods such as SURF, though useful for detecting interacting SNPs, neither identify specific interacting pairs nor develop a model. Because SURF & TuRF is able to detect interacting genetic variants which are predictive of human health, weights from this algorithm can be used to filter a dataset before traditional combinatorial approaches are used to characterize the interaction. McKinney et al. have previously integrated ReliefF  and TuRF  with other information sources using an evaporative cooling technique to direct genetic association analyses. Direct replacement of ReliefF by SURF & TuRF may improve the sensitivity of these frameworks to detect and characterize interactions.
SURF & TuRF's greatly increased success rate to detect epistasis improves our ability to detect variants leading to disease risk in the absence of main effects. This new distance based approach may also be extensible to biological and biomedical data beyond case-control genetic association studies. While ReliefF, which SURF & TuRF builds on, is usable for these discrete endpoints and attribute values, other modifications to ReliefF have extended this machine learning method to other data types. With Regression ReliefF (RReliefF), ReliefF is broadened to handle continuous attributes and endpoints [23, 24]. Future work should examine whether the new distance based approach used for SURF & TuRF also improves these methods. If using a distance threshold also improves RReliefF methods, the sensitive SURF approach can be applied to continuous gene expression data or to detecting variants predictive of continuous endpoints. With future work it may also be possible to combine continuous and discrete attributes, to provide a method capable of examining gene-gene, gene-environment, and environment-environment interactions in a common framework.
Now that it is technically and economically feasible to measure large numbers of DNA sequence variations in human genetics, the bioinformatics challenge is to identify and improve methods for detecting variants which are predictive of disease risk. This is particularly challenging when the task is to identify polymorphisms which have little or no independent effect. The Relief family of algorithms provides one potential solution for SNP selection, and SURF & TuRF is a novel within this family which effectively detects epistasis. By developing sensitive and computationally efficient methods capable of detecting epistasis, it becomes more practical to probe datasets for these interactions. Highly sensitive methods will allow researchers to better understand the impact of epistasis on human health. Both SURF and SURF & TuRF have been included as filtering methods in the user friendly open source Multifactor Dimensionality Reduction (MDR) software package.
As discussed SURF weights can be used for genetic analysis in either filters or probabilistic wrappers. Here we consider the simpler filter approach. Specifically we analyze SURF's ability to filter a dataset to the 99 th , 95 th and 75 th percentiles of SNPs without removing those SNPs with an interaction effect predictive of disease susceptibility. ReliefF has previously been used in the genetic analysis of complex diseases in this fashion .
The goal of our simulation study is to generate artificial datasets with high concept difficulty to evaluate SURF in the domain of human genetics. We first develop 30 different penetrance functions (i.e. genetic models) which determine the relationship between genotype and phenotype in our simulated data. These functions determine the probability that an individual has disease given his or her genotype. This probability depends only on the genotypes of the two interacting SNPs, not on the genotype of any one SNP. The 30 penetrance functions include groups of five with heritabilities of 0.025, 0.05, 0.1, 0.2, 0.3, or 0.4. These heritabilities range from very small to large genetic effect sizes. Each functional SNP has two alleles with frequencies of 0.4 and 0.6. These models are included in Additional file 2. Each of the models is used to generate 100 replicate datasets with sample sizes of 800, 1600, and 3200. Each dataset consists of an equal number of case (diseased) and control (disease free) subjects. Each pair of functional SNPs is added to a set of 998 irrelevant SNPs for a total of 1000 attributes. A total of 9,000 datasets are generated and analyzed.
We test each method with the following parameters. All four methods can use some or all of the dataset when performing weight estimations. Here we use the entire dataset, as this is similar to what is performed in practice where the number of individuals is often more limiting than the computational costs. ReliefF and TuRF require a number of neighbors. Here we use 10, as suggested by Robnik-Sikonja and Kononenko  in a comprehensive analysis. SURF requires a distance threshold. Our theoretical analysis in §2 of the appendix (Additional file 1) suggests that the mean distance between all pairs of individuals in the dataset and across all attributes can be used and thus we use this distance in this situation. By using the mean distance as calculated from the data, we remove this nuisance parameter from the algorithm. Both SURF & TuRF and TURF remove a number of SNPs at each iteration before re-estimating the weights of the remaining SNPs. Here we remove 25 SNPs at each iteration (2.5% of the dataset).
Here, because we are interested in interactions, we consider the success rate to be the number of times that both relevant SNPs are scored above a given threshold. We set this stricter standard here because further analysis steps can not succeed of both relevant parts of the interaction are not discovered. To estimate the success rate, we use 100 datasets for each of the 30 models. Specifically, the percentage of datasets for which a method ranks the two relevant SNPs above the N th percentile of all SNPs is the estimate of the method's success rate. We apply Fisher's exact test to assess the significance of differences between the success rates of the tested methods at these thresholds. These percentiles represent the situation where each method is used to filter a large dataset with 1000 SNPs to a smaller dataset of 10, 50, and 250 SNPs respectively. Fisher's exact test is a significance test appropriate for categorical count data . The resulting p-value for this test can be interpreted as the likelihood of seeing a difference of the size observed among success rates when the methods do not differ. We consider results statistically significant when p ≤ 0.05. Additionally, we graphically show results for filtering to each percentile from the 99 th to the 50 th .
This work is funded by NIH grants LM009012, AI59694, HD047447, and ES007373. The authors would like to thank Mr. Jason Gilmore for his technical assistance.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.