DA DA: DegreeAware Algorithms for NetworkBased Disease Gene Prioritization
 Sinan Erten^{1}Email author,
 Gurkan Bebek^{2, 3, 4},
 Rob M Ewing^{2, 3, 5} and
 Mehmet Koyutürk^{1, 2, 3}
DOI: 10.1186/17560381419
© Erten et al; licensee BioMed Central Ltd. 2011
Received: 2 September 2010
Accepted: 24 June 2011
Published: 24 June 2011
Abstract
Background
Highthroughput molecular interaction data have been used effectively to prioritize candidate genes that are linked to a disease, based on the observation that the products of genes associated with similar diseases are likely to interact with each other heavily in a network of proteinprotein interactions (PPIs). An important challenge for these applications, however, is the incomplete and noisy nature of PPI data. Information flow based methods alleviate these problems to a certain extent, by considering indirect interactions and multiplicity of paths.
Results
We demonstrate that existing methods are likely to favor highly connected genes, making prioritization sensitive to the skewed degree distribution of PPI networks, as well as ascertainment bias in available interaction and disease association data. Motivated by this observation, we propose several statistical adjustment methods to account for the degree distribution of known disease and candidate genes, using a PPI network with associated confidence scores for interactions. We show that the proposed methods can detect loosely connected disease genes that are missed by existing approaches, however, this improvement might come at the price of more false negatives for highly connected genes. Consequently, we develop a suite called DA DA, which includes different uniform prioritization methods that effectively integrate existing approaches with the proposed statistical adjustment strategies. Comprehensive experimental results on the Online Mendelian Inheritance in Man (OMIM) database show that DA DA outperforms existing methods in prioritizing candidate disease genes.
Conclusions
These results demonstrate the importance of employing accurate statistical models and associated adjustment methods in networkbased disease gene prioritization, as well as other networkbased functional inference applications. DA DA is implemented in Matlab and is freely available at http://compbio.case.edu/dada/.
Introduction
Identification of diseaseassociated genes is an important step toward enhancing our understanding of the cellular mechanisms that drive human diseases, with profound applications in modeling, diagnosis, prognosis, and therapeutic intervention [1]. Genomewide linkage and association studies (GWAS) in healthy and affected populations identify chromosomal regions potentially containing hundreds of candidate genes possibly associated with genetic diseases [2]. Investigation of these candidates using experimental methods is an expensive task, thus not always a feasible option. Consequently, computational methods play an important role in prioritization and identification of the most likely diseaseassociated genes by utilizing a variety of data sources such as gene expression [3, 4], functional annotations [4–7], and proteinprotein interactions (PPIs) [3, 8–14]. The scope of methods that rely on functional annotations is limited because only a small fraction of genes in the genome are currently annotated.
In recent years, several algorithms have been proposed to incorporate topological properties of PPI networks in understanding genetic diseases [3, 8, 13]. These algorithms mostly focus on prioritization of candidate genes and mainly exploit the notion that the products of genes associated with similar diseases have a higher chance of being connected in the network of PPIs. However, an important challenge for these applications is the incomplete and noisy nature of the PPI data [15]. Missing interactions and false positives affect the accuracy of methods based on local network information such as direct interactions and shortest distances. Few global methods based on simulation of information flow in the network (e.g., random walks [8, 13] or network propagation [14]) get around this problem to a certain extent by considering multiple alternate paths and whole topology of PPI networks. Nevertheless, as we demonstrate in this paper, these methods favor genes whose products are highly connected in the network and perform poorly in identifying loosely connected disease genes.
In this study, we propose novel statistical adjustment methods to correct for degree bias in information flow based disease gene prioritization. These methods aim to assess the statistical significance of the network connectivity of a candidate gene to known disease genes. For this purpose, we use three reference models that take into account the degree distribution of the PPI network: (i) reference model based on degree distribution of known disease gene products, (ii) reference model based on the degree of candidate gene products, and (iii) likelihood ratio test using eigenvector centrality as the reference model.
We present comprehensive experimental results demonstrating that the proposed statistical adjustment methods are very effective in detecting loosely connected disease genes which are generally less studied, thus potentially more interesting in terms of generating novel biological knowledge. However, we observe that these methods might perform less favorably in identifying highly connected disease genes. Consequently, we develop three uniform prioritization methods that effectively integrate existing algorithms with the proposed statistical adjustment methods, with a view of delivering high accuracy irrespective of the network connectivity of target disease genes. These methods choose the measure to rank candidate genes (raw scores vs. statistically adjusted scores), based on several criteria that take into account the network degree of candidates. Finally, we present comprehensive experimental results in the Results section. These results show that the resulting prioritization methods, implemented in Matlab as a suite called DA DA, outperform existing approaches in identifying diseaseassociated genes.
Background
In this section, we introduce the networkbased disease gene prioritization problem in a formal framework. We then discuss two information flow based disease gene prioritization algorithms that represent the state of the art in networkbased disease gene prioritization. These algorithms are random walk with restarts and network propagation. Next, we present detailed experimental results to demonstrate the limitations of these algorithms in the context of the degree distribution of PPI networks and ascertainment bias in interaction data. Based on these observations, we motivate our approach of applying statistical adjustment to the scores computed by these algorithms.
Networkbased candidate disease gene prioritization
There exists a wide range of disease gene prioritization methods that are based on the analysis of the topological properties of PPI networks. These methods commonly rely on the observation that the products of genes that are associated with similar diseases have a higher likelihood of physically interacting [11]. It is important to note here the distinction between genes and their products. Genomewide association studies focus on identifying genes that are associated with a disease of interest. Networkbased prioritization aims to aid this effort by inferring functional associations between genes based on the interactions among their products, i.e., proteins. For this reason, any reference to interactions between genes in this paper refers to the interactions between their products.
Existing methods for networkbased disease gene prioritization can be classified into two main categories; (i) localized methods, i.e., methods based on direct interactions and shortest paths between known disease genes and candidate genes [3, 9, 16], (ii) global methods, i.e., methods that model the information flow in the cell to assess the proximity and connectivity between known disease genes and candidate genes. Several studies show that global approaches, such as random walk and network propagation, clearly outperform local approaches [13, 14, 17]. For this reason, we focus on global methods in this paper.
For a given disease of interest D, the input to the disease gene prioritization problem consists of two sets of genes, the seed set and the candidate set . The seed set specifies prior knowledge on the disease, i.e., it is the set of genes known to be associated with D and diseases similar to D. Each gene is also associated with a similarity score σ(v, D), indicating the known degree of association between v and D. The similarity score for gene v is computed as the maximum phenotypic similarity between D and any other disease associated with v, based on clinical description of diseases (a detailed discussion on computation of phenotypic similarity scores can be found in the Methods section below). The candidate set specifies the genes, one or more of which are potentially associated with disease D (e.g., these genes might lie within a linkage interval that is identified by association studies). The overall objective of network based disease prioritization is to use a human PPI network , to compute a score α(v, D) for each gene , such that α(v, D) represents the likelihood of v to be associated with D.
The PPI network consists of a set of gene products and a set of undirected interactions between these gene products, where represents an interaction between and . Since PPI data might be obtained from various resources, interactions are often assigned confidence scores indicating their reliability. In other words, the network is also associated with a function , such that w(uv) indicates the reliability of interaction . Finally, the set of interacting partners of a gene product is defined as and the total reliability of known interactions of v is defined as which we refer as weighted degree throughout this paper. Global prioritization methods use this network information to compute α by propagating σ over . Candidate proteins are then ranked according to α and novel genes that are potentially associated with the disease of interest are identified based on this ranking.
Random walk with restarts
This method simulates a random walk on the network to compute the proximity between two nodes by exploiting the global structure of the network [18, 19]. It is used in a wide range of applications, including identification of functional modules [20] and modeling the evolution of social networks [21]. Recently, random walk with restarts has also been applied to candidate disease gene prioritization [8, 13].
Here, ρ denotes the restart vector with for and 0 otherwise. P _{RW} denotes the stochastic matrix derived from , i.e., P _{RW}(u, v) = w(uv)/W(v) for and 0 otherwise. For each , x _{ t } (v) denotes the probability that the random walk will be at v at time t, where x _{0} = ρ. For each gene v, the resulting randomwalk based association score is defined as α _{RW}(v, D) = lim _{ t→∞ } x _{ t } (v). The elements in the resulting vector α represent the proximity of each protein to the proteins in the seed set.
Network propagation
Propagation based models have been previously shown to be effective in network based functional annotation of proteins [22]. In recent work, Vanunu et al. [14] propose a network propagation algorithm to compute the association between candidate proteins and known disease genes. They define a prioritization function which models an information pump that originates at the seed sets. This idea is very similar to that of random walk with restarts, with one key difference. Namely, in network propagation, the flow of information is normalized by not only the total outgoing flow from each node, but also the total incoming flow into each node.
Here, the propagation matrix P _{NP} is computed as for , 0 otherwise. For each , y _{ t } (v) denotes the amount of disease association information at node v at step t, where y _{0} = ρ. For each gene v, the resulting network propagation based association score is defined as α _{NP}(v, D) = lim _{ t→∞ } y _{ t } (v). In this model, 0 ≤ r ≤ 1 is also a userdefined parameter that is used to adjust the relative importance of prior knowledge and network topology.
Role of network degree
The dependency of performance on network degree can be understood by carefully inspecting the formulation of random walk and network propagation models. Random walk with restarts is actually a generalization of Google's wellknown pagerank algorithm [23]. Indeed, for r = 0, α is solely a measure of network centrality. Therefore, for any r > 0, α(v, D) contains a component that represents the network centrality of v, in addition to its association with D. Network propagation alleviates this problem by normalizing the incoming flow into a gene, therefore provides a slightly more balanced performance compared to random walk with restarts. However, as evident in the figure, its performance is still influenced heavily by node degrees. Motivated by these insights, we argue that the association scores computed by these algorithms have to be statistically adjusted with respect to reference models that take into account the degree distribution of the network.
Methods
In this section, we propose several reference models for assessing the significance of networkbased disease association scores. Subsequently, we discuss how these models can be used in conjunction with existing methods to obtain uniform prioritization methods that can deliver high accuracy regardless of the centrality of candidate genes. Next, we introduce the disease and PPI datasets and the details of the experimental settings used.
Reference Models for Statistical Adjustment
Here, we consider three different reference models for assessing the significance of disease association scores obtained by an information flow based prioritization algorithm: (i) a model that compares the association scores of candidate genes computed using the original seed set with an empirical distribution that is obtained by using randomly generated seed sets that match the degree distribution of the original seed set, (ii) a model that compares the association scores of candidate genes to an empirical distribution that is derived from other genes with similar degree in the network, using a fixed seed set, (iii) a model that assesses the likelihoodratio of the association of a gene with the seed set with respect to its network centrality. Here, for the sake of clarity, we describe each model assuming that random walk based restarts is used to compute raw association scores, however, the methods are directly applicable to network propagation as well (we also drop the subscript RW from our notation for simplicity).
Reference model based on seed degrees
The objective here is to generate a reference model that captures the degree distribution of seed proteins accurately. To this end, we compare the association score α(v, D) for each protein with scores computed using random seed sets (by preserving the degree distribution of the seed genes). The expectation here is that false positives that correspond to centralized and highly connected proteins will have high association scores even with respect to these randomly generated seed sets. Such an observation implies that their observed association with the actual seed set is most likely artificial, thus the statistical significance of their association score is low. On the other hand, loosely connected proteins that lie close to the seed proteins will be assigned high significance although their association scores are generally lower than that of highly connected proteins.
Given a disease D, seed set , and candidate set , this reference model is implemented as follows:

We first compute networkbased association scores α(v, D) for the original seed set , using the procedure described by Equation 1.

Then, based on the original seed set , we generate a random instance that represents in terms of weighted degree distribution. is generated as follows:

Then, each protein is assigned to bucket if W (v) − W (u) < W (v) − W (u') for all , where ties are broken randomly.

Subsequently, is generated by choosing a protein from each bucket uniformly at random. It can be observed that each protein in is represented by exactly one protein in , thus the total weighted degree of proteins in is expected to be very close to that of .

For 1 ≤ i ≤ n, the association scores α ^{(i)}for seed set are computed using Equation 1. Here, n is a sufficiently large number that is used to obtain a representative sampling {α ^{(1)}, α ^{(2)}, α ^{(3)}, ..., α ^{(n)}} of the population of association scores for seed sets that match the size and degree distribution of (we use = 1000 in our experiments).

We then estimate the mean of this distribution as and the standard deviation as .
Reference model based on candidate degree
This reference model aims to assess the statistical significance of the association score of a protein v with respect to a population of association scores that belong to proteins with weighted degree similar to that of v. For protein v, if other proteins with similar weighted degrees usually have high association scores for a given seed set, the statistical significance of the association score of v can be considered low. On the other hand, loosely connected proteins are compared to other loosely connected proteins, which potentially have lower association scores, thus the artificial advantage of the highly connected proteins is removed.
This reference model is generated as follows:

First, we compute the networkbased association vector α with respect to the given seed set , again using Equation 1.

Then, for each candidate gene , we select the n genes in the network with smallest W (v) − W (u) to create a representative set that contains the n genes most similar to v in terms of their weighted degree (n = 1000 in our experiments).

Subsequently, for each gene , we estimate the mean association score of its representative population as and the standard deviation of association scores as .

Finally, we compute the adjusted score of each candidate gene v as .
Likelihoodratio test using eigenvector centrality
Here, we assess the association of a gene with the seed set using a likelihoodratio test. More precisely, considering α _{RW}(v, D) as the likelihood of v being associated with the seed set for disease D, we compare this likelihood with the likelihood of v being associated with any other gene product in the network. To compute the likelihood of v's association with any other gene in the network, we use eigenvector centrality [23], which is precisely equal to the random walk based association score of v for zero restart probability (r = 0).
Uniform Prioritization
As we demonstrate in the next section, the statistical adjustment strategies presented above improve the performance of global prioritization algorithms in identifying loosely connected disease genes. However, this comes at the price of increased number of false negatives for highly connected disease genes. Motivated by this observation, we propose several uniform scoring strategies that aim to take advantage of both raw and statistically adjusted scores. The idea here is to derive a uniform prioritization method that uses the adjusted scores for loosely connected candidate genes, while using the raw scores for highly connected candidate genes.
For this purpose, we first sort the raw crosstalk scores (α _{RW} or α _{NP} ) of candidate genes in descending order. Let R _{RAW}(v) denote the rank of gene in this ordering. Clearly, for , R _{RAW}(v) < R _{RAW}(u) indicates v is more likely to be associated with the disease than u is. Similarly, we sort the statistically adjusted scores (α _{SD}, α _{CD}, or α _{EC}) in descending order, to obtain a rank R _{ADJ}(v) for each gene .
We propose three alternate strategies for merging these two rankings to obtain a uniform ranking R _{UNI}, where the objective is to have R _{UNI}(v) < R _{UNI}(u) if gene v is associated with the disease, while gene u is not. Once R _{UNI}(v) is obtained using one of the following methods, we map it into the interval in the obvious way.
Uniform prioritization based on the degree of candidate gene
for each . Thus the ranking of a highly connected gene is based on its raw association score, while that of a loosely connected gene is based on the statistical significance of its association score. Note that, with respect to this definition, the ranking of two genes can be identical, but there cannot be more than two genes with identical ranking. In the case of a tie, the tie is broken based on the unused ranking of each gene.
Optimistic uniform prioritization
for each . Again, ties are broken based on the unused rankings.
Uniform prioritization based on degree of known disease genes
Based on the notion that some diseases are studied more in detail compared to other diseases, we expect the degrees of genes associated with similar diseases to be somewhat close to each other. Statistical tests on disease associations currently available in the OMIM (Online Mendelian Inheritance in Man) database confirms this expectation (data not shown). We take advantage of this observation to approximate the network degree of the unknown disease gene in terms of the weighted degrees of the known disease genes. This enables having a global criterion for choosing the preferred ranking for all genes, as opposed to the genespecific (or "local") criteria described above.
For a given seed set , we first compute . Subsequently, if (where λ is defined as above), we set for all , otherwise, we set . Observe that, this approach is global, i.e. rankings used do not depend on the weighted degree of each candidate gene, but the average weighted degree of the seed set. On the other hand, the other two uniform prioritization methods presented earlier are local, i.e. ranking used for each candidate gene depends on the gene being considered.
Datasets
We test and compare the proposed methods on a comprehensive set of disease association data, using an integrated human PPI network in which interactions are associated with reliability scores. We describe these datasets in detail below.
Disease Association and Phenotypic Similarity Data
We obtain disease information from the Online Mendelian Inheritance in Man (OMIM) database. OMIM provides a publicly accessible and comprehensive database of genotypephenotype relationship in humans. We map genes associated with diseases to our PPI network and remove those diseases for which we are unable to map more than two associated genes. After this step, we have a total of 206 disease families with at least 3 associated genes. Number of genes associated with these diseases ranges from 3 to 36, with the average number of associations for each disease being approximately 6.
As mentioned previously, proteins associated with similar diseases or phenotypes lie in close proximity in the PPI network. This brings the idea of utilizing disease similarity information in to identification of disease genes [3, 14]. Driel et al. [24] incorporate disease similarity information using a text mining algorithm that can be summarized as follows:

First, each OMIM record is parsed and the keywords are searched for existence in the anatomy (A) and the disease (C) sections of the Medical Subject Headings Vocabulary (MeSH), which is a controlled vocabulary of U.S. National Library of Medicine. MeSH is especially useful for applications that use information that contains different terminology for identical concepts.

Each OMIM record is then represented by binary vectors where each entry of the vector corresponds to the existence of a term in that record.

Similarity of two diseases is then computed by calculating the cosine of the angle between their representative vectors.
After these calculations, a similarity score between 0 and 1 for every pair of OMIM diseases is available. This disease similarity information is used to compute the prior association between each gene and the disease of interest as follows: Let d _{ c } denote the disease of interest, θ = {d _{1}, d _{2}, d _{3}, ..., d _{ t } } represent the set of other diseases for which information is available and ϕ(d _{ i } , d _{ j } ) denote the similarity between diseases d _{ i } and d _{ j } . Also, let S _{ i } denote the set of genes associated with disease d _{ i } . Note that, these sets are not always disjoint, i.e. there are some genes that are associated with multiple diseases.
For the disease of interest d _{ c } , these methods consider other diseases d _{ i } in θ such that ϕ(d _{ c } , d _{ i } ) > γ where γ is a userdefined threshold that is used to decide which diseases are considered similar. We follow [14] and utilize a logistic function to represent these similarity scores based on the empirical findings related to setting of γ in [24]. Now since vector σ(v, D) in Equations 1 and 2 represent the prior information on the association between each gene and the disease d _{ c } , σ(v, D) can be set to the similarity of the disease that involves gene v and the disease of interest d _{ c } .
Note that, if a gene is associated with more than one disease, its prior association score with d _{ c } is set to the maximum of the similarity scores of d _{ c } and the diseases it is associated with.
ProteinProtein Interaction (PPI) Data
In our experiments, we use the human PPI data obtained from NCBI Entrez Gene Database [25]. This database integrates interaction data from several other databases available, such as HPRD, BioGrid, and BIND. After the removal of nodes with no interactions, the final PPI network contains 8959 proteins and 33528 distinct interactions among these proteins. Using a logistic regression model, we also assign reliability scores to each of these PPIs [26, 27]. In this logistic regression model, we incorporate (i) the Pearson correlation of expression measurements across a range of different tissues and conditions for the corresponding coding genes (denoted X _{1}(uv) for proteins u and v), (ii) the proteins' small world clustering coefficient (denoted X _{2}(uv)) [28], and (iii) the proteins' subcellular localization data (denoted X _{3}(uv)) [29, 30]. For correlation of gene expression, the expression profiles of normal human tissues measured in the Human Body Index Transcriptional Profiling are used (GSE7307) [31]. In total, 213 normal samples are processed, representing over 90 distinct tissue types and a global expression correlation is acquired for each pair of interacting partners. The protein subcellular localization data is used to eliminate interactions that are not biologically relevant, based on the expectation that proteins that are not colocalized are not likely to interact with each other. Although proteins travel in the cell and can coexist in multiple compartments, incorporation of subcellular localization data helps eliminate many false negative interactions.
Once these statistics are obtained for each pair of proteins, we compute the reliability of interaction as the probability of a true interaction between u and v given X (uv) = (X _{1}(uv), X _{2}(uv), X _{3}(uv)), under the logistic distribution. Namely, we define , where I(uv) is the indicator random variable representing the existence of a true interaction between u and v. The parameters β _{0}, β _{1}, β _{2}, and β _{3} are optimized to maximize the likelihood of a true interaction using training data that includes true positive, as well as true negative interactions. For this purose, we randomly select 1000 PPIs from the MIPS [32] database of interactions, an accepted gold standard for true positive interactions. The negative training set is acquired from randomly selected PPIs that are reported in Negatome, a database of proteins that are known not to interact with each other [33]. These experiments that involve random true positive and true negative interactions are repeated 1000 times and optimal values for all parameters are determined. Finally, the probability of each interaction, i.e., its reliability score, is calculated using these parameters.
Experimental Setting
In order to evaluate the performance of different methods in terms of accurately prioritizing diseaseassociated genes, we apply leaveoneout crossvalidation. For each gene that is associated with a disease, we conduct the following experiment:

We remove that gene from the set of genes associated with the disease. We call the gene that is removed the target gene for that experiment. The remaining genes associated with the disease compose the seed set .

We generate an artificial linkage interval, containing the target gene with other 99 genes located nearest in terms of genomic distance. The genes in this artificial linkage interval (including the target gene) compose the candidate set . Note that, according to our experiments, the size of candidate set does not have a significant effect on the performance difference between different methods as long as it is greater than 20 (data not shown).

Using each of the methods described in the previous section, we obtain a ranking of candidate genes and use this ranking to predict disease genes.
In order to systematically compare the performance of different methods, we use the following evaluation criteria:
ROC curves
We plot ROC curves, i.e., sensitivity vs. 1specificity, by thresholding the rank to be considered a "predicted disease gene" from 1 to 100. Sensitivity (recall) is defined as the percentage of true disease genes that are ranked above the particular threshold, whereas specificity is defined as the percentage of all genes that are ranked below the threshold. The area under ROC curve (AUC) is used as a measure to assess the performance of different methods. Note that, AUC is a conservative measure for this experimental setup since there exists only one true positive (the target gene) for each experiment. For this reason, we also use other performance criteria that take into account the rank of the target gene.
Average rank
This is the average rank of the target gene among all candidate genes, computed across all diseasegene pairs in a total of ten executions of the experiments. Clearly, a lower number for average rank indicates better performance.
Percentage of the disease genes ranked in top 1% and 5%
Percentages of true disease genes that are ranked as one of the genes in the top 1% (practically, the top gene) and also in the top 5% among all candidates are also reported.
Results and Discussion
In this section, we comprehensively evaluate the performance of the methods presented in the previous section. We start our discussion by investigating the effect of selection of the restart probability in random walk with restarts and network propagation algorithms. We then discuss in detail the effect of using the statistical adjustment methods. Next, we evaluate the performance of proposed uniform prioritization strategies and show that our final uniform method outperforms existing methods for networkbased disease gene prioritization.
Selection of Restart Probability
The restart probability represented by r in Equations 1 and 2 is a parameter that is used to adjust the preference between the importance of a protein with respect to the seed set and network topology. The effect of the selection of the restart probability is minor unless a very small value is used [34]. The performance degrades significantly for the values r ≤ 0.01, which is expected because the effect of the seed genes is minimized in that case, thus the proximity of a protein is calculated based primarily on the centrality of that protein in the network. We observe that r = 0.3 is optimal for the performance of both random walk with restarts and network propagation algorithms, after running the algorithms with small increments of r[34]. Thus, r is set to 0.3 in all experiments presented in this paper, for all algorithms.
Performance of Statistical Adjustment Methods
The effect of statistical adjustment on performance.
All Genes  Degrees ≤ m  Degrees >m  

Method  Avg. Rank  AUROC  Avg. Rank  AUROC  Avg. Rank  AUROC 
Network Propagation  22.75  0.77  30.00  0.70  15.57  0.84 
Random walk w/restarts  23.75  0.76  34.76  0.65  12.82  0.88 
Based on seed degree  21.42  0.78  21.51  0.79  21.35  0.79 
Based on candidate degree  19.50  0.81  21.20  0.79  17.82  0.82 
Based on centrality  20.40  0.80  24.35  0.75  16.47  0.84 
Performance of Uniform Prioritization
Here, we systematically investigate the performance of the proposed uniform prioritization methods, by considering the combination of each of these methods with each of the three statistical adjustment methods (a total of nine combinations) applied in conjunction with random walk with restarts. In these experiments, the degree threshold λ is set to the average weighted degree of all genes. For convenience, we refer to each uniform prioritization method using the corresponding ranking symbol introduced in the previous section .
Performance of all combinations of uniform prioritization methods.
Comparison of the proposed method with existing approaches.
METHOD  Avg. Rank  AUROC  Perc. Ranked in top 1%  Perc. Ranked in top 5% 

DA DA  21.52  0.79  29.41  53.57 
Network propagation  22.75  0.77  28.18  51.52 
Random walk w/restarts  23.75  0.76  29.25  51.76 
Case Example
Conclusions
In this paper, we have shown that approaches based on global network properties in prioritizing diseaseassociated genes are highly biased by the degree of the candidate gene, thus perform poorly in detecting loosely connected disease genes. We proposed several statistical adjustment strategies that improve the performance, particularly in identifying loosely connected disease genes. We have shown that, when these adjustment methods are used together with existing global methods, the resulting method outperforms existing approaches significantly. These results clearly demonstrate that, in order to avoid exacerbation of ascertainment bias and propagation of noise, networkbased biological inference methods have to be supported by statistical models that take into account the degree distribution. DA DA is freely available for download at http://compbio.case.edu/dada/.
Declarations
Acknowledgements
We would like to thank the reviewers for many useful suggestions. This work is supported, in part, by the National Science Foundation (NSF) Award CCF0953195. This work is also supported, in part, by a Choose Ohio First Scholarship to SE by the State of Ohio.
Authors’ Affiliations
References
 Brunner HG, van Driel MA: From syndrome families to functional genomics. Nat Rev Genet. 2004, 5 (7): 545551.View ArticlePubMedGoogle Scholar
 Glazier AM, Nadeau JH, Aitman TJ: Finding Genes That Underlie Complex Traits. Science. 2002, 298 (5602): 23452349. 10.1126/science.1076641. [http://www.sciencemag.org/cgi/content/abstract/298/5602/2345]View ArticlePubMedGoogle Scholar
 Lage K, Karlberg E, Storling Z, Olason P, Pedersen A, Rigina O, Hinsby A, Tumer Z, Pociot F, Tommerup N, Moreau Y, Brunak S: A human phenomeinteractome network of protein complexes implicated in genetic disorders. Nat Bio. 2007, 25 (3): 309316. 10.1038/nbt1295.View ArticleGoogle Scholar
 Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, Carmeliet P, Moreau Y: Gene prioritization through genomic data fusion. Nat Biotech. 2006, 24 (5): 537544. 10.1038/nbt1203.View ArticleGoogle Scholar
 Adie E, Adams R, Evans K, Porteous D, Pickard B: SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics. 2006, 22 (6): 773774. 10.1093/bioinformatics/btk031. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/22/6/773]View ArticlePubMedGoogle Scholar
 Turner F, Clutterbuck D, Semple C: POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology. 2003, 4 (11): R7510.1186/gb2003411r75. [http://genomebiology.com/2003/4/11/R75]View ArticlePubMedPubMed CentralGoogle Scholar
 Masotti D, Nardini C, Rossi S, Bonora E, Romeo G, Volinia S, Benini L: TOM: enhancement and extension of a tool suite for in silico approaches to multigenic hereditary disorders. Bioinformatics. 2008, 24 (3): 428429. 10.1093/bioinformatics/btm588. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/3/428]View ArticlePubMedGoogle Scholar
 Chen J, Aronow B, Jegga A: Disease candidate gene identification and prioritization using protein interaction networks. BMC Bioinformatics. 2009, 10: 7310.1186/147121051073. [http://www.biomedcentral.com/14712105/10/73]View ArticlePubMedPubMed CentralGoogle Scholar
 Oti M, Snel B, Huynen MA, Brunner HG: Predicting disease genes using proteinprotein interactions. J Med Genet. 2006, jmg.2006.041376, [http://jmg.bmj.com/cgi/content/abstract/jmg.2006.041376v1]Google Scholar
 Oti M, Brunner HG: The modular nature of genetic diseases. Clinical Genetics. 2007, 71: 111.View ArticlePubMedGoogle Scholar
 Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási ALAA: The human disease network. PNAS. 2007, 104 (21): 86858690. 10.1073/pnas.0701361104.View ArticlePubMedPubMed CentralGoogle Scholar
 Ideker T, Sharan R: Protein networks in disease. Genome research. 2008, 18 (4): 644652. 10.1101/gr.071852.107.View ArticlePubMedPubMed CentralGoogle Scholar
 Köhler S, Bauer S, Horn D, Robinson PN: Walking the Interactome for Prioritization of Candidate Disease Genes. Am J Hum Genet. 2008, 82 (4): 949958. 10.1016/j.ajhg.2008.02.013. [http://linkinghub.elsevier.com/retrieve/pii/S0002929708001729]View ArticlePubMedPubMed CentralGoogle Scholar
 Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R: Associating genes and protein complexes with disease via network propagation. PLoS computational biology. 2010, 6: e1000641+View ArticlePubMedPubMed CentralGoogle Scholar
 Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, Gerstein M: Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends in Genetics. 2002, 18 (10): 529536. 10.1016/S01689525(02)027634.View ArticlePubMedGoogle Scholar
 George RA, Liu JY, Feng LL, BrysonRichardson RJ, Fatkin D, Wouters MA: Analysis of protein sequence and interaction data for candidate disease gene prediction. Nucl Acids Res. 2006, 34 (19): e13010.1093/nar/gkl707. [http://nar.oxfordjournals.org/cgi/content/abstract/34/19/e130]View ArticlePubMedPubMed CentralGoogle Scholar
 Navlakha S, Kingsford C: The power of protein interaction networks for associating genes with diseases. Bioinformatics. 2010, 26 (8): 10571063. 10.1093/bioinformatics/btq076. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/26/8/1057]View ArticlePubMedPubMed CentralGoogle Scholar
 Lovász L: Random Walks on Graphs: A Survey. Combinatorics, Paul Erdos is Eighty. 1996, 2: 353398. doi: 10.1.1.39.2847.Google Scholar
 Tong H, Faloutsos C, Pan JY: Random walk with restart: fast solutions and applications. Knowledge and Information Systems. 2008, 14 (3): 327346. 10.1007/s1011500700942.View ArticleGoogle Scholar
 Macropol K, Can T, Singh A: RRW: repeated random walks on genomescale protein networks for local cluster discovery. BMC Bioinformatics. 2009, 10: 28310.1186/1471210510283. [http://www.biomedcentral.com/14712105/10/283]View ArticlePubMedPubMed CentralGoogle Scholar
 Tong H, Faloutsos C: Centerpiece subgraphs: problem definition and fast solutions. KDD '06: Proceedings of the 12th ACM SIGKDD. 2006, NY, USA: ACM, 404413.View ArticleGoogle Scholar
 Nabieva E, Jim K, Agarwal A, Chazelle B, Singh M: Wholeproteome prediction of protein function via graphtheoretic analysis of interaction maps. Bioinf. 2005, 21: i302310. 10.1093/bioinformatics/bti1054. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/21/suppl_1/i302]View ArticleGoogle Scholar
 Brin S, Page L: The Anatomy of a LargeScale Hypertextual Web Search Engine. Computer Networks and ISDN Systems. 1998, 30: 107117. 10.1016/S01697552(98)00110X. [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.109.4049]View ArticleGoogle Scholar
 van Driel MA, Bruggeman J, Vriend G, Brunner HG, Leunissen JA: A textmining analysis of the human phenome. EJHG. 2006, 14 (5): 535542. 10.1038/sj.ejhg.5201585.View ArticlePubMedGoogle Scholar
 Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez Gene: genecentered information at NCBI. Nucl Acids Res. 2007, 35 (suppl1): D2631. [http://nar.oxfordjournals.org/cgi/content/abstract/35/suppl_1/D26]View ArticlePubMedGoogle Scholar
 Bebek G, Yang J: PathFinder: mining signal transduction pathway segments from proteinprotein interaction networks. BMC Bioinformatics. 2007, 8: 33510.1186/147121058335.View ArticlePubMedPubMed CentralGoogle Scholar
 Patel VN, Bebek G, Mariadason JM, Wang D, Augenlicht LH, Chance MR: Prediction and Testing of Biological Networks Underlying Intestinal Cancer. PLoS ONE. 2010, 5 (9): e1249710.1371/journal.pone.0012497.View ArticlePubMedPubMed CentralGoogle Scholar
 Goldberg DS, Roth FP: Assessing experimentally derived interactions in a small world. Proc Natl Acad Sci USA. 2003, 100 (8): 43726. 10.1073/pnas.0735871100.View ArticlePubMedPubMed CentralGoogle Scholar
 Sprenger J, Lynn Fink J, Karunaratne S, Hanson K, Hamilton NA, Teasdale RD: LOCATE: a mammalian protein subcellular localization database. Nucleic Acids Res. 2008, D2303. 36 DatabaseGoogle Scholar
 Huh WK, Falvo JV, Gerke LC, Carroll AS, Howson RW, Weissman JS, O'Shea EK: Global analysis of protein localization in budding yeast. Nature. 2003, 425 (6959): 68691. 10.1038/nature02026.View ArticlePubMedGoogle Scholar
 Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for highthroughput functional genomic data. Nucleic Acids Res. 2009, D88590. 37 DatabaseGoogle Scholar
 Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, Warfsmann J, Ruepp A: MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004, D414. 32 DatabaseGoogle Scholar
 Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G, Montrone C, Rattei T, Frishman D, Ruepp A: The Negatome database: a reference set of noninteracting protein pairs. Nucleic Acids Res. 2010, D5404. 38 DatabaseGoogle Scholar
 Erten MS: Network Based Prioritization of Disease Genes. Master's thesis. 2009, Case Western Reserve UniversityGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.