Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions

Background Genome-wide association studies are becoming the de facto standard in the genetic analysis of common human diseases. Given the complexity and robustness of biological networks such diseases are unlikely to be the result of single points of failure but instead likely arise from the joint failure of two or more interacting components. The hope in genome-wide screens is that these points of failure can be linked to single nucleotide polymorphisms (SNPs) which confer disease susceptibility. Detecting interacting variants that lead to disease in the absence of single-gene effects is difficult however, and methods to exhaustively analyze sets of these variants for interactions are combinatorial in nature thus making them computationally infeasible. Efficient algorithms which can detect interacting SNPs are needed. ReliefF is one such promising algorithm, although it has low success rate for noisy datasets when the interaction effect is small. ReliefF has been paired with an iterative approach, Tuned ReliefF (TuRF), which improves the estimation of weights in noisy data but does not fundamentally change the underlying ReliefF algorithm. To improve the sensitivity of studies using these methods to detect small effects we introduce Spatially Uniform ReliefF (SURF). Results SURF's ability to detect interactions in this domain is significantly greater than that of ReliefF. Similarly SURF, in combination with the TuRF strategy significantly outperforms TuRF alone for SNP selection under an epistasis model. It is important to note that this success rate increase does not require an increase in algorithmic complexity and allows for increased success rate, even with the removal of a nuisance parameter from the algorithm. Conclusion Researchers performing genetic association studies and aiming to discover gene-gene interactions associated with increased disease susceptibility should use SURF in place of ReliefF. For instance, SURF should be used instead of ReliefF to filter a dataset before an exhaustive MDR analysis. This change increases the ability of a study to detect gene-gene interactions. The SURF algorithm is implemented in the open source Multifactor Dimensionality Reduction (MDR) software package available from .


Penetrance functions
We assume that we are given a sample of N individuals, each of class either s, for sick, or w, for well. Two relevant, epistatic SNPs, say SNP 1 and SNP 2, govern the distribution of these classes according to a given penetrance function. All other SNPs, numbered 3 through N I + 2, have no effect.
We also assume that each SNP of each individual is in one of three states: 1, 2, or 3 corresponding to the genotypes AA, Aa, and aa, respectively. Let p ij be the probability that SNP i of a random individual is in state j, and put P i = p i1 , p i2 , p i3 , a vector associated with the ith SNP. We assume Hardy-Weinberg equilibrium holds. So if the frequency of the major allele of SNP i is p, then p i1 = p 2 , p i2 = 2p(1 − p) and p i3 = (1 − p) 2 . The states of different SNPs are independent. Thus P (ij) = p 1i p 2j , where P (ij) is the probability an individual has genotype ij, meaning SNP 1 has state i and SNP 2 state j.
Set f ij = P (s|ij), the probability that an individual is in class s if his genotype is ij. These probabilities are given by the penetrance function. Let C i = f 1i , f 2i , f 3i and R i = f i1 , f i2 , f i3 for i = 1, 2 and 3. (The six vectors R i and C i are just the rows and columns of the penetrance table.) To say that SNP 1 and SNP 2 are an epistatic pair means that all six marginal penetrances P 1 · C i and P 2 · R i , i = 1, 2, 3 are equal. Their common value is often denoted by k, and is the probability an individual is sick regardless of her genotype.
We will often use P (ij|s), the probability a sick individual has genotype ij. By Bayes' formula P (ij|s) = P (ij)P (s|ij) k = p 1i p 2j f ij k .
Similarly P (ij|w) = P (ij)P (w|ij) These, together with the assumption that SNPs 1 and 2 are epistatic give, for fixed i, Similarly, for fixed j, For n = 0, 1 and 2, let P H (n∆) be the probability that exactly n of the two relevant SNPs change state in going from an arbitrary individual to another of the same class. Similarly, P M (n∆) are, by definition, probabilities of numbers of state changes for arbitrary misses, rather than hits. There are two types of hits: sick-to-sick and well-to-well. Probabilities involving these analogous to the P H (n∆) are denoted by P ss (n∆) and P ww (n∆). We have Also, where || P i || is the norm of the vector P i . For the last equality we have used that i,j P (ij|s) = 1 and A similar computation shows that Assuming that the number of sick individuals in the sample equals the number of well ones, we have P H (0∆) = 1 2 P ss (0∆) + P ww (0∆) and P H (1∆) = 1 2 P ss (1∆) + P ww (1∆) = || P 1 || 2 + || P 2 || 2 − 2P H (0∆).

Similar computations for misses give
Computing P M (2∆) directly or using the fact that 2 n=0 P M (n∆) = 1 gives It now follows that We note that the derivation of this requires that SNPs 1 and 2 are epistatic. We will see that the quantity P H (0∆) − P M (0∆) in equation (1) is a measure of how well SURF detects the relevant SNPs. It is a considerably more accurate measure of this than heritability.

SURF
Let SNP i be any SNP, relevant or not. The probability that it is in the same state, or matches, in two randomly chosen individuals is ||P i || 2 . We now assume that all N I irrelevant SNPs have the same major allele frequency. Then the number of matching irrelevant SNPs in a random pair of individuals satisfies the binomial distribution corresponding to the Bernoulli trials process with N I trials and probability of success || P || 2 , where P = P 3 . By the Central Limit theorem we have, to very good approximation where . Now fix a random individual I i . This partitions the sample set of individuals (with I i removed) into six sets H j∆ and M j∆ , where j = 0, 1 and 2. The set H j∆ consists of all individuals in the same class as I i with exactly j relevant SNPs in a different state from those of I i . The M j∆ are defined similarly with j the number of state changes between misses. The cardinalities of these sets are where N is the sample size. (These are really expected cardinalities, but we use them as approximations. We also neglect, for now, the fact that there is one fewer hit than miss.) Let b(n, p) be the random variable giving the number of successes in a Bernoulli trials process consisting of n trials with p the probability of success of each. The random variable b(|M 1∆ |, p(d − 1)) then counts the number of individuals in the set M 1∆ with d − 1 or more irrelevant SNPs in the same state as those of the fixed individual I i . This is the same as the number of individuals in this set having distance ≤ 1000−d from is the number in the set M 2∆ with d or more matching SNPs, and b(|M 0∆ |, p(d − 2)) the number in the the set M 0∆ with such SNPs. The random variable then approximates part of individual I i 's contribution to the relief score of a relevant SNP. This part comes from all misses within distance 1000 − d of I i . The random variable gives the corresponding contribution from hits. The total contribution of individual I i to the (unnormalized) relief score of a relevant SNP using those hits and misses with distance Up to the approximation given by the Central Limit Theorem, the probability density functions (or, briefly, the PDFs) of the random variables b(n, p) are gaussians. Since convolution preserves these, the probability density function of S R i is also a gaussian, so is determined by its mean and variance.

The mean of S
Regardless of the penetrance function, this has maximum value when d is chosen so that where P (∆) = 1 − || P || 2 is the probability that an irrelevant SNP has different states in two arbitrary individuals. Next we consider the score of an arbitrary, irrelevant SNP, say SNP k. We continue to assume that I i is a random, fixed individual, and M j∆ and H j∆ the associated partitioning sets. Let H ∆ be the subset of those individuals in the same class as I i with the state of SNP k differing from that of I i . Define a subset M ∆ of misses similarly. Individual I i 's contribution to the score of SNP k is given by the random variable Here p 1 (d) is defined just as p(d) was in equation (2), except with N I − 1 in place of N I . Since the states of an irrelevant SNP are independant of those of all other SNPs, we have The variance is Next we work towards finding the values of d which makes SURF most effective, that is most likely to assign higher scores to relevant SNPs than to irrelevant ones. The functions Let S R i,1 and S R i,2 be individual I i 's contributions to the scores of the two relevant SNPs. A computable, but crude measure of the effectiveness of SURF is the probability that both S R i,1 and S R i,2 are > individual I i 's contribution to the score of a random irrelevant SNP. This probability is where φ I * is the PDF of S I i , so a gaussian with mean and variance as above, and Φ R * is the CDF of S R i . Machine computation shows that P (min{S R i,1 , S R i,2 } > S I i ) is largest at b and decreases on [b, N I + 2], very slowly near b.
An accurate and standard measure of the effectiveness of SURF, the success rate, is This is difficult to compute since the score of an irrelevant SNP is If φ R and φ I , the PDFs of i S R i and i S I i , respectively, are known, then the success rate can be computed via 3 Relief using a single nearest neighbor Next we analyze Relief. Let S be any of the sets H j∆ or M j∆ (or any subset of the set of individuals provided the states of the irrelevant SNPs of members of S are randomly assigned according to the given allele frequencies). Let M (I j , I i ), or just M ij , denote the number of irrelevant SNPs of individuals I i and I j which match. We will find the probability density function of the random variable Max |S| = max I j M (I j , I i ), where I j ranges over S and I i is fixed. Note that if we disregard relevant SNPs, then N I − Max |S| is the distance from I i to its closest neighbor in S.

Set
Then, as with the probability p(d) in equation (2) above, we have, for an arbitrary pair of individuals I i and I j , the good approximation Since the states of irrelevant SNPs among different individuals are independent Next we define probabilities P CH (n∆) and P CM (n∆) for n = 0, 1 and 2. These are analogous to the P H (n∆) and P M (n∆) above, but use closest hits and misses rather than arbitrary ones. Specifically, P CH (n∆) is the probability that exactly n relevant SNPs change state in going from an arbitrary individual to a (random) closest neighbor in the same class. Closest misses are used for P CM (n∆). We express the P CH (n∆) and P CM (n∆) in a number of different forms. All rely on the fact that if I j ∈ M k∆ , then the number of SNPs of I i and In general, (10) where k = 0, 1 or 2, and {i, j, k} = {1, 2, 3}. Changing each M to H gives analogous formulas for the P CH (k∆). As a check, we note that For the version of Relief in which only a single nearest neighbor is used, individual I i 's contribution to the score of a relevant SNP is given by the random variable Here P R CM (∆) = 1 2 P CM (1∆) + P CM (2∆), the probability that SNP 1 changes state in going from I i to a (random) closest miss, and P R CH (∆) = 1 2 P CH (1∆) + P CH (2∆). The mean of U R i is and the variance is For the score of an irrelevant SNP, say SNP k, we first define Φ 1 just as Φ was, but with N I − 1 in place of N I in equation (8). Let This is the probability that given a random individual I j ∈ M ∆ , the number of SNPs, both relevant and irrelevant, of I i and I j which agree is ≤ x. Note that F M (x − 1) is the same probability, but for I j ∈ M Σ , the set consisting of all misses with SNP k in the same state as that of individual I i . The probability that SNP k of a random closest miss does not match SNP k of individual I i is From now on we omit those expressions involving hits which can be obtained from the analogous displayed ones by changing each M to H.

ReliefF with n nearest neighbors
Relief algorithms typically use 10 nearest neighbors, rather than just one. To analize this, let P CM (c 1 ∆, . . . , c n ∆) be the probability that a (random) miss closest to individual I i is in the set M c1∆ , a (random) second closest miss is in M c2∆ , etc., up to a random n th closest miss is in M c n ∆ . For k = 0, 1 and 2, let n k = n−1 i=1 δ(k, c i ), the number of the c 1 , . . . , c n−1 equal to k. Then, for each (c 1 , . . . , c n ) ∈ {0, 1, 2} n , we have Here R c 1 ...c n (x) is the probability that for a random n-tuple (I 1 , . . . , It can be given inductively by Again, as a check, we note that Thus and variance We remark that, as with SURF, the random variables T R i are not indepedant since, for instance, the relation "is the nearest neighbor of" among individuals tends to be symmetric. The discussion of irrelevant SNP scores parallels that of the relevant ones. Using analogous notation with an I, for irrelevant, appended we have, for each (c 1 , . . . , c n ) ∈ {0, 1} n , where

Examples
We conclude by using the ideas developed in this appendix to compare SURF with ReliefF using 10 nearest neighbors. We assume a sample size of 1600 and use the .1 heritability penetrance functions of models 19 and 17. For relevant SNPs, equation (11) (3) and (4) and the approximations of setting M (S I i ) = 0 and V (S R i ) = V (S I i ). We also set d = 384 which maximizes M (S R i ) as discussed just after equation (3).
We would now like to use equation (7), and its analog for ReliefF, to compute the success rates of the two methods, but don't know the PDFs φ I and φ R implicit in this equation. So we approximate these by assuming that the random variables S I i , as well as the S R i , are independent and use the Central Limit theorem. The PDFs of i T I i and i T R i are approximated similarly. Then equation (7) overestimates slightly the success rates since, as mentioned, the S I i and the S R i , as well as the T I i and T R i can be somewhat correlated. Machine computation using equation (7) and the approximate PDFs gives the success rates shown in Figure 1.
The quantity given by equation (1) for model 17 is .02502, which is about average for the five penetrance functions with heritability .1 used in our simulations. So the figure involving model 17 is representative of the success rates of the two methods. For model 19, equation (1) gives .05446, the highest value of the five. Figure 2 for model 19 is included to show how much success rates depend on penetrance functions, even those having the same heritability.
The number of nearest neighbors used by SURF in our simulations is, on the average, one fourth the sample size. Since SURF outperforms Relief, one wonders if Relief algorithms could be improved by using many more than the usual 10 nearest neighbors for detecting epistatic pairs. Our preliminary work here looks very promising.