Correlating genetic variants with disease is the primary method used for elucidating the genetic causes of common diseases like Alzheimer’s disease and type 2 diabetes mellitus. Compared to common diseases, the genetic underpinnings of Mendelian diseases such as cystic fibrosis and thalassemia are generally simpler, as they are caused by variation at a single genetic locus. In contrast, common diseases are caused by genetic variants at multiple loci, with each locus conferring modest risk of developing disease. The most common genetic variation is the single nucleotide polymorphism (SNP), which is a DNA sequence variation occurring when a single base pair at a specific location in the genome differs among individuals in a population. Due to the rapidly falling cost of genotyping, the volume of genomic data available to researchers is growing very quickly. A key challenge in the analyses of such data is to develop computational methods that efficiently identify complex associations between genetic variants (such as SNPs) and disease
One example of a complex interaction among genetic variants is epistasis, wherein the genetic variants interact in a nonlinear fashion in their association with disease. Epistatic effects are particularly difficult for many algorithms to detect, because they may require examination of interacting genetic variants taken together, rather than one at a time. Because the number of potential interactions grows exponentially with the number of variants being examined, it is not computationally feasible to evaluate every potential higher order interaction in high-dimensional data. Many current methods address this problem by using a two stage procedure where they identify a subset of variants associated with disease in a univariate fashion in the first stage, and then evaluate interactions only within this smaller set of variants in the second stage
. However, if epistatic variants have small or no main effects, the univariate selection in the first stage will exclude the epistatic variants so that they are not even considered in the second stage. One approach to overcoming this weakness is to use methods in the first stage that will identify variants with main effects, interaction effects, and both main and interaction effects. The Relief algorithms are an excellent candidate for application in the first stage since it is capable of identifying variants with both main and interaction effects.
The Relief algorithms are a family of attribute weighting algorithms that can efficiently identify associations between attributes (e.g., SNPs) and the class (e.g., disease status) even if the attributes have nonlinear interactions (e.g., epistatic) without significant main effects
[3, 4]. These algorithms run in time O(n
a), where n is the number of samples and a is the number of attributes. For SNP data, typically a >> n, making Relief algorithms particularly attractive for this domain. In the next section, we briefly describe the original Relief algorithm, and in the section after that we describe extensions of Relief that have been applied to SNP data.
The Relief algorithm was first described by Kira and Rendell
 as a simple, fast, and effective approach to attribute weighting. The output of the Relief algorithm is a weight between −1 and 1 for each attribute, with more positive weights indicating more predictive attributes. The pseudocode for Relief is shown below. The weight of an attribute is updated iteratively as follows. A sample is selected from the data, and the nearest neighboring sample that belongs to the same class (nearest hit) and the nearest neighboring sample that belongs to the opposite class (nearest miss) are identified. A change in attribute value accompanied by a change in class leads to upweighting of the attribute based on the intuition that the attribute change could be responsible for the class change. On the other hand, a change in attribute value accompanied by no change in class leads to downweighting of the attribute based on the observation that the attribute change had no effect on the class. This procedure of updating the weight of the attribute is performed for a random set of samples in the data or for every sample in the data. The weight updates are then averaged so that the final weight is in the range [−1, 1]. The attribute weight estimated by Relief has a probabilistic interpretation. It is proportional to the difference between two conditional probabilities, namely, the probability of the attribute’s value being different conditioned on the given nearest miss and nearest hit respectively
set W[a] = 0 for each attribute a
for i = 1 to n do
select sample s
from data at random
find nearest hit s
and nearest miss s
for each attribute a do
[a] = diff(a, s
) - diff(a, s
W[a] = W[a] + ΔW
for each attribute a do
W[a] = W[a] / n
where diff(a, s
) = 0, if s
[a] = s
= 1, if s
[a] ≠ s
ReliefF, SURF, and SURF*
The ReliefF algorithm extends Relief to deal with data that is noisy, incomplete, and has more than two classes. The original Relief algorithm uses 2 neighboring samples (1 nearest hit and 1 nearest miss) during each iteration of the outer for loop in the pseudocode above. ReliefF uses 2k neighboring samples (k nearest hits and k nearest misses), and averages their contributions to update the attribute weights
. In contrast to most other feature ranking or feature selection methods that consider attributes univariately, Relief algorithms are able to capture attribute interactions because the global distance measure which defines sample proximity is a multivariate function. However, because the nearest neighbors are identified by a distance measure that incorporates all attributes, the presence of many irrelevant or noisy attributes (as in SNP data) can lead to suboptimal identification of nearest neighbors.
In ReliefF, as the number of nearest neighbors of each class (parameter k) is varied, a trade-off occurs. The addition of more neighbors decreases the algorithm’s susceptibility to noise. However, the inclusion of more neighbors leads to decreased ability in identifying locally interacting attributes. In ReliefF, the number of nearest neighbors used is a user-specified parameter that is constant from iteration to iteration. Spatially sensitive variations of ReliefF, on the other hand, use a neighborhood of constant spatial dimensions, rather than a constant number of neighbors. In densely populated regions of the sample space, more nearby neighbors are used, improving the discriminative power of the algorithm. Conversely, in sparsely populated regions of the sample space, fewer neighbors are used to update the attribute weights. This improves performance because distant uninformative neighbors are not used in order to meet a k neighbor quota, as in ReliefF
The most recently described spatially sensitive methods are Spatially Uniform ReliefF (SURF) and SURF* (called “SURF star”). SURF uses all neighbors found within a certain distance threshold to update the attribute weights. SURF* uses both nearby and distant neighbors to update the attribute weights. Nearby neighbors are used to update the attributes weights in the same way as SURF. In addition, SURF* uses the distant neighbors to update the attribute weights in a manner opposite to the weight update rule used for the nearby neighbors. SURF* has been applied to synthetic genomic data and has been shown to be more accurate than ReliefF and SURF in identifying interacting genetic variants
Iterative variations of ReliefF exist, and some have greater accuracy than the corresponding non-iterative variations
. Tuned ReliefF (TuRF), for example, reduces the effect of noisy attributes by dropping the least predictive attributes from the dataset and running ReliefF iteratively
. Because TuRF essentially runs ReliefF multiple times, improvement of the underlying algorithm will improve the iterative results. As such, we focus our attention on non-iterative variations of Relief.
In this paper, we formulate a new variation of Relief called Sigmoid Weighted ReliefF Star (SWRF*). We evaluate SWRF* on synthetic data and compare its performance to that of ReliefF and SURF*. We also present a framework called the Modular Relief Framework (MoRF) that can be used to develop novel variations of Relief. MoRF facilitates the formulation of new variations of Relief by allowing the specification of new functions for component terms. The SWRF* algorithm was developed using the MoRF framework. In contrast to previous variations of Relief, SWRF* utilizes a soft neighbor weighting kernel.