This article has Open Peer Review reports available.
Arete – candidate gene prioritization using biological network topology with additional evidence types
© The Author(s). 2017
Received: 24 August 2016
Accepted: 12 June 2017
Published: 6 July 2017
Refinement of candidate gene lists to select the most promising candidates for further experimental verification remains an essential step between high-throughput exploratory analysis and the discovery of specific causal genes. Given the qualitative and semantic complexity of biological data, successfully addressing this challenge requires development of flexible and interoperable solutions for making the best possible use of the largest possible fraction of all available data.
We have developed an easily accessible framework that links two established network-based gene prioritization approaches with a supporting isolation forest-based integrative ranking method. The defining feature of the method is that both topological information of the biological networks and additional sources of evidence can be considered at the same time. The implementation was realized as an app extension for the Cytoscape graph analysis suite, and therefore can further benefit from the synergy with other analysis methods available as part of this system.
We provide efficient reference implementations of two popular gene prioritization algorithms – DIAMOnD and random walk with restart for the Cytoscape system. An extension of those methods was also developed that allows outputs of these algorithms to be combined with additional data. To demonstrate the utility of our software, we present two example disease gene prioritization application cases and show how our tool can be used to evaluate these different approaches.
Identification of genes associated with a disease is an essential first step in developing novel treatments and gaining better insight into the underlying mechanisms of disease. Many widely employed contemporary experimental approaches, like genome-wide association studies (GWAS) or differential gene expression analysis, yield lists of genes potentially enriched for promising candidates , which then need to be further refined and verified experimentally. Network-based prioritization approaches are one of promising strategies that can effectively combine and interpret large volumes of prior knowledge about different types of interactions between biological entities. In particular, two broad strategies of network-based prioritization have emerged – those that consider global network topology by employing some type of a diffusion or Markov process formalism [2, 3] and those that focus on local topology in specific network neighborhoods . Biological networks can also be further enriched by additional types of data that can potentially be used to further increase performance of network topology based methods.
Although a veritable variety of disease gene prioritization solutions are now available, the efforts so far have chiefly focused on leveraging specific, pre-defined types of data. In this respect it is possible to identify several types of typical approaches. The first approach is to develop a specialized integrated knowledgebase resource to support gene prioritization analysis. Some prominent examples in this category include PrixFixe , ENDEAVOR , GeneMANIA , Gene Prospector  and DAPPLE . The advantages of such a setup is an ability to closely tailor the analysis method to make best use of these data and being able to pre-compute some of the more time-consuming analysis steps. However this comes at the cost of restricting the user’s choices and necessitates continued maintenance of the underlying datasets to ensure they remain relevant. Given the logistic constraints, access to such methods is usually delivered via web page-based interfaces [6, 8, 9] or web services [5, 7], and therefore may not be suitable for cases where confidentiality and data security is important. Approaches of the second type offer some data acquisition functionality, such as calling external web services to further enrich the input provided by the user. Tools following this strategy include Genotator, which performs real-time integration of eleven clinical genetics resources , GPEC, which can query different annotation databases to build up the seed set of genes  and JEPETTO that can dynamically retrieve additional information from pathway databases . In this case, while the data can be easily kept up-to-date, the analysis approach is usually built around those specific types of data. And lastly, although some tools can work on user-provided datasets, they are only capable of using some pre-defined types of information [13, 14]. Some notable examples of such tools include NetworkPrioritizer  and iCTNet . NetworkPrioritizer supports computation of multiple centrality measures and allows them to be combined using several rank aggregation algorithms. iCTNet is an example of a database-based approach, where a prioritization algorithm relies on a pre-integrated and developer-maintained database. In contrast to these two methods, our approach is based on similarity to a set of representative seeds and allows incorporation of both network-based and other data in the form of node annotations. Due to the extent of previous effort in disease gene prioritization tool development, only very brief summary of them could be provided here and for a more comprehensive discussion of the subject we would like to recommend the following reviews [1, 17–19].
Given potentially complex etiology of diseases and diverse types of data collected in biomedical research, we believe in potential benefits of a more flexible approach, more agnostic with respect to types of data. The benefits of such an approach would be to give greater control to the users by allowing them to make the best possible use of their own project-specific datasets as well as any publically available information. Our tool, Arete, combines network analysis capabilities with integrative analysis and in doing so allows users to further enrich these results with their own information of different types. The network-based analysis component offers two modern prioritization algorithms: random walk with restart  and DIAMOnD . Our tool is implemented as an app plug-in for the popular Cytoscape  graph analysis suite in order to make the best possible use of the synergies with data acquisition and analysis capabilities of this system and its rich ecosystem of plug-ins. Our primary goal for this tool is to facilitate interactive, visual exploration of the network through means of filtering and graph annotation to direct users to sets of genes enriched for promising candidates.
Implementation of graph topology analysis methods
Arete offers two reference implementations of network-based gene prioritization approaches – a random walk with restart (RWR)  and DIAMOnD , which is an iterative, local neighborhood-based method. It was reported in  that diffusion-based approaches, like RWR, appear to perform better when candidate genes are somewhat dispersed throughout the network, whereas neighborhood-based approaches - when genes are concentrated in tightly linked cliques. Therefore, by offering a robust algorithm in each of these two categories Arete aims to accommodate both of these cases. In the current version only unweighted and undirected versions of both algorithms are available.
The DIAMOnD algorithm starts by considering the immediate neighbors of the seed nodes and selects a node with the smallest probability of having at least as many connections to seed nodes according to the hypergeometric distribution. Once a node is picked, it is added to the seed set and the process is repeated until the desired number of candidates is picked. Candidate genes are ranked in ascending order based on the iteration step at which they were picked. Both DIAMOnD and RWR network topology-based methods can be run on their own or combined with other types of data, as explained in detail in the following section. The parameter values for both of these algorithms were set according to recommendations of their respective original authors. From the analysis reported in both cases, we expect that these values are likely to be near-optimal in vast majority of cases and will rarely, if ever need to be adjusted by users.
Integrative prioritization approach
Here k is an instance to be scored, c is a set of seed instances of target class, T is a set of all isolation trees (t) in the ensemble, and n is a set of instances selected at a node of tree t. The scoring metric quantifies the co-occurrence of a particular instance with instances of a target class at different nodes. The balancing parameter, α can be adjusted to emphasize either highly specific similarity to (potentially smaller number of) seed instances (values between 0.0 and 1.0) or overall similarity to multiple seed instances (values above 1.0). The underlying rationale behind the scoring approach is that similar instances are less likely to be randomly separated and therefore will tend to co-occur at the nodes of the tree more frequently compare to unrelated ones. The advantages of proposed method include capturing effects of interacting attributes (which will generate more pure groups with higher scores), the non-parametric nature of the algorithm and relatively few critical options requiring input from the user.
The GUI interface offers three customization options: number of trees to generate, balancing parameter and a switch for controlling how to select a value for splitting at tree nodes. As the algorithm is stochastic, selecting a larger number of trees will tend to lead to more consistent results between different runs at the cost of lower speed, though the underlying level of performance will only be adversely affected if this option is set very low. By default, the split point can be any number between the minimum and maximum values of an attribute in a set of instances selected for particular node during construction stage. This default behavior can be changed to make all splits equally likely, which is equivalent to rank-transforming all data. The default values for all these parameters were chosen by performing tests on gene sets for particular diseases and Gene Ontology biological process categories between 10–100 genes in size. The sets of reference disease genes for this task were taken from DisGeNET database  and were chosen to be distinct from the ones used in the evaluation example described below. The aim was to set all parameter values at levels where an adequate result will be generated in most cases in order to create a reasonable starting point from which a user can experiment further.
Cytoscape App user interface
During evaluation, each known “true positive” gene that is withheld from the seed set for a given evaluation run is ranked in its own list of reference unlabeled genes. The tool offers two options for providing such reference lists. The first (default) option is to automatically construct these lists by drawing random non-seed genes from the network. This option is the likely most common application scenario, where a user does not have a pre-defined reference set of interest. The second option is for user to provide their own reference lists. This is done by providing a separate tab-delimited file where the first column is a relevant gene and the rest of the line is its reference list. This option has been used in the first example use-case, where a reference list was constructed using neighboring genes in the genome.
Lastly, the prioritized genes can be highlighted in the Cytoscape network view by changing the color of respective nodes according to their ranks (Fig. 2 – right panel). A filter can be applied to select highly ranked nodes and, optionally, their neighbors at a particular level.
Example use-case 1: ranking candidate genes in a genomic region
Additional metrics used for integrative analysis example and informal descriptions of what properties they capture
Overall remoteness from all other nodes
Density of interlinks among immediate neighbors
Network “choke points” with high proportion of shortest paths going through them
Ubiquitous versus tissue-specific expression
Location in a dense core versus network periphery
The comparative evaluation considered five different setups. First, the prioritization was done separately using RWR, DIAMOnD and iRF based on the five metrics only. Then, two more runs where performed where DIAMOnD or RWR scores were also included as features in the iRF set. To explore the results we have computed ROC-AUC statistic according to the method described in  using leave-one-out, 3-fold and 5-fold cross-validation schemes. Additionally, we have looked at fold-enrichment for known disease genes in different quartiles of resulting ranked lists. To provide a representative sample of likely performance of all methods in a “worst case”/baseline scenario, all reported analyses were done with the chosen default options of our application. Therefore, no attempt was made to specifically optimize setting for each of the diseases in the example dataset. As associated gene sets are likely to be quite distinctive, we expect that different parameters may be optimal in each individual case. In practice, an expected use-case will only usually involve a single disease or set of genes and a user may choose to interactively optimize the settings to further improve results.
Example use-case 2: ranking candidate genes in a transcriptomic study
For the second example we illustrate how our software can be used in combination with an example transcriptomic study. Here we have used data from a microarray profiling experiment E-GEOD-15245, which investigated how gene transcription in the blood changes in the period preceding multiple sclerosis (MS) relapse . A complete, processed dataset from this study was downloaded from the EBI’s ArrayExpress database . We have chosen samples taken less than a year prior to observed MS relapse and where “definite MS” was confirmed. The reasoning behind this was that these samples are most likely to capture disease-relevant responses and therefore will be most useful for identification of disease-driving genes. These selected 24 expression profiles were scaled and integrated with the network and a known set of MS genes. Both the network and MS gene set were taken from the dataset used in the first use-case. The combined dataset was again analyzed using all of the methods available in Arete tool with all relevant parameters left at recommended default values. In this case we have chosen to evaluate the performance by drawing 100 random reference genes (per each known MS gene) from all unlabeled genes in the network, as not having a pre-defined reference gene list is more consistent with an expected scenario for transcriptomics-based application cases.
Results and discussion
At the time of writing, we were aware of three tools that offer different variants of the random walk algorithm for the Cystoscope suite, however, all of these offered an approximate, iterative solution rather than an exact one. One of the advantages of the exact solution is that it has been shown to be robust to restart probability parameter  and therefore will produce a near-optimal result without the need for time-consuming optimisation. At present, Arete is also the first tool to provide an implementation of DIAMOnD algorithm in Cytoscape. In terms of providing the evaluation functionality, the only other tool also offering it is GPEC, but GPEC has somewhat limited dataset customization functionality and is no longer available for Cytoscape 3.0 or later. As we outlined in the introduction, with respect to integrative analysis, the diversity of data and integration methods being used is quite extensive. However, the main focus of most efforts has so far been to optimally exploit particular public datasets, or to closely couple the analysis method with specific, pre-generated datasets. To the contrary, our intention has been to develop an approach that is flexible and generic. In combination with the easy-to-use data import and acquisition methods of Cytoscape system our approach allows users to build and leverage their own resources. Additional flexibility is achieved by: (1) offering performance evaluation capabilities that can be used to explore and understand the impact of particular features and (2) interactive, user-driven exploration of results in the graph interface.
In addition to ROC-AUC analysis, we have also looked at the fold-enrichment, which, again, was explored using 3-fold, 5-fold and leave-one-out cross validation schemes (Fig. 3c-d). For this analysis we have split the ranked lists into four quartiles and compared the actual distribution of known disease genes with the one expected by chance. For all of the methods, a substantially higher enrichment was predominantly achieved in the first quartile, where between 1.6 and 2.4-fold more relevant genes were recovered.
Disease-related genes can play a role in more than one disease and are often associated with high network centrality, which is emphasized both by incorporation of network-specific properties via iRF and by the RWR algorithm. Potentially, this can cause a positive bias with respect to those genes, as inevitably there will be some overlap between sets of genes for different diseases and high centrality genes are more likely to be in this overlap. To explore this possibility, we have looked at the distribution of multi-disease genes in our dataset (Fig. 3b) and investigated whether such effects had substantial influence on performance (Additional file 1: Figure S1) As about 67% of all genes in our dataset were only involved in one disease, we have split our data into a single-disease and multi-disease subsets (2 or more associated diseases per gene) and re-calculated all of the performance statistics for these subsets. Although the performance was slightly higher for multi-disease genes according to both ROC- AUC and fold-enrichment metrics, this difference was too small to indicate a definite and substantial bias in this case (Additional file 1: Figure S1).
To conclude, as previously noted in , our results from use-case one also hint at the possibility that at least some of predictive network-based properties may be particularly effective only in specific cases and consequently heterogeneity likely exist between such properties of genes associated with different diseases. The second use-case illustrated how our approach can be used to identify most relevant disease-causing genes from transcriptomics data. These results indicate that even without further optimization, all of the methods provided in Arete can be suitable for identifying approximately relevant gene sets from experimental data. Therefore, in combination with interactive visualization capabilities of the Cytoscape system itself, Arete can effectively support analysis of complex biological networks by facilitating identification of smaller, meaningful gene sets for further manual exploration by the user.
Although large and diverse number of disease gene prioritization software are now available, emphasis has been primarily on approaches that either work on a specific pre-integrated knowledgebases or public web resources; or are only able to consider particular types of biomedical data by design. At the same time, biomedical application cases often rely on their own ‘omics datasets, data from different studies and experiments and highly specialized expert knowledge. This creates a niche for a more generalized tool that can allow non-technical users to exploit project-specific integrated datasets, identify promising combinations of predictive features and find likely candidate genes, which are more directly supported by context-specific evidence. Our proposed solution fills this niche by achieving a pivot between flexibility and ease-of-use, while at the same time also delivering adequate levels of performance and evaluation capabilities for comparing different setups. Using the example analysis presented in this paper, we also demonstrated that our proposed multiple evidence integration method can further enhance the performance achievable by network topology-based methods alone.
We would like to acknowledge the assistance offed by Takashi Morizono with deploying and providing continued maintenance and support for the Arete website.
This work was supported by the Grant-in-Aid for RIKEN IMS and CREST from the Japan Science and Technology Agency.
Availability of data and materials
Source code, pre-compiled app, documentation and example files used in this paper are available at Arete homepage (http://emu.yokohama.riken.jp/arete/arete.html) under the conditions of the GNU GPL v3 licence. The website includes also includes a guide in the form of a step-by-step tutorial, which can be used to reproduce the analysis results.
AL developed the software, designed the website and prepared the example application datasets, KAB assisted with data analysis and preparation of the manuscript. TT was involved in the development of core ideas behind this work, provided project supervision and contributed to the writing of the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lan W, Wang J, Li M, Peng W, Wu F. Computational approaches for prioritizing candidate disease genes based on PPI networks. Tsinghua Sci Technol. 2015;20(5):500–12.View ArticleGoogle Scholar
- Kohler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82(4):949–58.View ArticlePubMedPubMed CentralGoogle Scholar
- Smedley D, Kohler S, Czeschik JC, Amberger J, Bocchini C, Hamosh A, Veldboer J, Zemojtel T, Robinson PN. Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases. Bioinformatics. 2014;30(22):3215–22.View ArticlePubMedPubMed CentralGoogle Scholar
- Ghiassian SD, Menche J, Barabasi AL. A DIseAse MOdule detection (DIAMOnD) algorithm derived from a systematic analysis of connectivity patterns of disease proteins in the human interactome. PLoS Comput Biol. 2015;11(4):e1004120.View ArticlePubMedPubMed CentralGoogle Scholar
- Tasan M, Musso G, Hao T, Vidal M, MacRae CA, Roth FP. Selecting causal genes from genome-wide association studies via functionally coherent subnetworks. Nat Methods. 2015;12(2):154–9.View ArticlePubMedGoogle Scholar
- Aerts S, Lambrechts D, Maity S, Van Loo P, Coessens B, De Smet F, Tranchevent LC, De Moor B, Marynen P, Hassan B, et al. Gene prioritization through genomic data fusion. Nat Biotechnol. 2006;24(5):537–44.View ArticlePubMedGoogle Scholar
- Mostafavi S, Ray D, Warde-Farley D, Grouios C, Morris Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 2008;9 Suppl 1:S4.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu W, Wulf A, Liu T, Khoury MJ, Gwinn M. Gene prospector: an evidence gateway for evaluating potential susceptibility genes and interacting risk factors for human diseases. BMC bioinformatics. 2008;9:528.View ArticlePubMedPubMed CentralGoogle Scholar
- Rossin EJ, Lage K, Raychaudhuri S, Xavier RJ, Tatar D, Benita Y, Cotsapas C, Daly MJ. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 2011;7(1):e1001273.View ArticlePubMedPubMed CentralGoogle Scholar
- Wall DP, Pivovarov R, Tong M, Jung JY, Fusaro VA, DeLuca TF, Tonellato PJ. Genotator: a disease-agnostic tool for genetic annotation of disease. BMC Med Genomics. 2010;3:50.View ArticlePubMedPubMed CentralGoogle Scholar
- Le DH, Kwon YK. GPEC: a cytoscape plug-in for random walk-based gene prioritization and biomedical evidence collection. Comput Biol Chem. 2012;37:17–23.View ArticlePubMedGoogle Scholar
- Winterhalter C, Widera P, Krasnogor N. JEPETTO: a cytoscape plugin for gene set enrichment and topological analysis based on interaction networks. Bioinformatics. 2014;30(7):1029–30.View ArticlePubMedGoogle Scholar
- Wang L, Matsushita T, Madireddy L, Mousavi P, Baranzini SE. PINBPA: cytoscape app for network analysis of GWAS data. Bioinformatics. 2015;31(2):262–4.View ArticlePubMedGoogle Scholar
- Jadamba E, Cho SB, Shin M. NetRanker: a network-based gene ranking tool using protein-protein interaction and gene expression data. BioChip Journal. 2015;9(4):313–21.View ArticleGoogle Scholar
- Kacprowski T, Doncheva NT, Albrecht M. NetworkPrioritizer: a versatile tool for network-based prioritization of candidate disease genes or other molecules. Bioinformatics 2013:29(11):1471–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang L, Khankhanian P, Baranzini SE, Mousavi P. ICTNet: a cytoscape plugin to produce and analyze integrative complex traits networks. BMC bioinformatics. 2011;12(1):1.View ArticleGoogle Scholar
- Tranchevent LC, Capdevila FB, Nitsch D, De Moor B, De Causmaecker P, Moreau Y. A guide to web tools to prioritize candidate genes. Brief Bioinform. 2011;12(1):22–32.View ArticlePubMedGoogle Scholar
- Moreau Y, Tranchevent LC. Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat Rev Genet. 2012;13(8):523–36.View ArticlePubMedGoogle Scholar
- Gill N, Singh S, Aseri TC. Computational disease gene prioritization: an appraisal. J Comput Biol. 2014;21(6):456–65.View ArticlePubMedGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504.View ArticlePubMedPubMed CentralGoogle Scholar
- Shim JE, Hwang S, Lee I. Pathway-dependent effectiveness of network algorithms for gene prioritization. PLoS One. 2015;10(6):e0130589.View ArticlePubMedPubMed CentralGoogle Scholar
- Official oj! Algorithms [http://ojalgo.org].
- Okamura Y, Aoki Y, Obayashi T, Tadaka S, Ito S, Narise T, Kinoshita K. COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic acids research 2014:43(Database issue):D82–6.PubMedPubMed CentralGoogle Scholar
- Liu FT, Ting KM, Zhou Z-H. Isolation-based anomaly detection. ACM Trans Knowledge Discov Data (TKDD). 2012;6(1):3.Google Scholar
- Pinero J, Queralt-Rosinach N, Bravo A, Deu-Pons J, Bauer-Mehren A, Baron M, Sanz F, Furlong LI. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database J biol databases and curation. 2015;2015:1-17.Google Scholar
- Barrenas F, Chavali S, Holme P, Mobini R, Benson M. Network properties of complex human disease genes identified through genome-wide association studies. PLoS One. 2009;4(11):e8090.View ArticlePubMedPubMed CentralGoogle Scholar
- Chavali S, Barrenas F, Kanduri K, Benson M. Network properties of human disease genes with pleiotropic effects. BMC Syst Biol. 2010;4(1):1–11.View ArticleGoogle Scholar
- Ghersi D, Singh M. Disentangling function from topology to infer the network properties of disease genes. BMC Syst Biol. 2013;7(1):1–12.View ArticleGoogle Scholar
- Feldman I, Rzhetsky A, Vitkup D. Network properties of genes harboring inherited disease mutations. Proc Natl Acad Sci U S A. 2008;105(11):4323–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Lage K, Hansen NT, Karlberg EO, Eklund AC, Roque FS, Donahoe PK, Szallasi Z, Jensen TS, Brunak S. A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci. 2008;105(52):20870–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson A, Kampf C, Sjostedt E, Asplund A, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347(6220):1260419.View ArticlePubMedGoogle Scholar
- Gurevich M, Tuller T, Rubinstein U, Or-Bach R, Achiron A. Prediction of acute multiple sclerosis relapses by transcription levels of peripheral blood cells. BMC Med Genomics. 2009;2:46.View ArticlePubMedPubMed CentralGoogle Scholar
- Parkinson H, Sarkans U, Shojatalab M, Abeygunawardena N, Contrino S, Coulson R, Farne A, Lara GG, Holloway E, Kapushesky M, et al. ArrayExpress--a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2005;33(Database issue):D553–555.View ArticlePubMedGoogle Scholar