The multiscale backbone of the human phenotype network based on biological pathways

Background Networks are commonly used to represent and analyze large and complex systems of interacting elements. In systems biology, human disease networks show interactions between disorders sharing common genetic background. We built pathway-based human phenotype network (PHPN) of over 800 physical attributes, diseases, and behavioral traits; based on about 2,300 genes and 1,200 biological pathways. Using GWAS phenotype-to-genes associations, and pathway data from Reactome, we connect human traits based on the common patterns of human biological pathways, detecting more pleiotropic effects, and expanding previous studies from a gene-centric approach to that of shared cell-processes. Results The resulting network has a heavily right-skewed degree distribution, placing it in the scale-free region of the network topologies spectrum. We extract the multi-scale information backbone of the PHPN based on the local densities of the network and discarding weak connection. Using a standard community detection algorithm, we construct phenotype modules of similar traits without applying expert biological knowledge. These modules can be assimilated to the disease classes. However, we are able to classify phenotypes according to shared biology, and not arbitrary disease classes. We present examples of expected clinical connections identified by PHPN as proof of principle. Conclusions We unveil a previously uncharacterized connection between phenotype modules and discuss potential mechanistic connections that are obvious only in retrospect. The PHPN shows tremendous potential to become a useful tool both in the unveiling of the diseases’ common biology, and in the elaboration of diagnosis and treatments.


Introduction
In this age of system-wide biology, in which organisms and their environment are considered as a whole, a new field has emerged; studying diseases in relationship to one another. Pioneering studies, such as Goh et al.'s [1] have resulted in the definition of the Human Disease Network (HDN). Elucidating relationships between human traits or diseases is becoming increasingly -genetic disorders. These traits may be related through http://www.biodatamining.org/content/7/1/1 biology. These groups are formed independently of the actual disease classes, based solely on intrinsic network properties.

Background
In this section, we define the fundamental concepts used in the Methods section below to build the PHPN.

Genetic data
The catalog of published GWAS maintained by the National Human Genome Research Institute (NHGRI) at the National Institute of Health aggregates studies that report phenotype-to-SNP(s) and phenotype/SNP-to-gene associations (http://www.genome. gov/gwastudies/). The NHGRI catalog, downloaded in March 2013, was the primary source of PT-to-gene to association data. It reports over 800 PTs associated with approximately 2,300 genes and 6,000 SNPs.
Biological pathways represent elaborate series of cascading biochemical reactions occurring within the cell, and possibly receiving external signals [7]. Pathways govern all major cellular functions, such as cell cycle, cell respiration, or apoptosis (programmed cell death). Biochemical compounds, (e.g. nucleic acids, proteins, complexes and small molecules) participating in reactions form a network of biological processes and are grouped into pathways. Reactome is an open-source, open access, manually curated and peer-reviewed pathway database (http://www.reactome.org). It visually displays structured information about the elements, enzymes, and genes (via their gene products) within many known pathways. The Reactome database was accessed in March 2013.

Networks
Networks (or graphs) provide a means of intuitively visualizing and characterizing complex systems, and have proven to be particularly valuable in modeling biological systems. The statistical analysis of the graph properties offers a quantitative and holistic means of revealing underlying connections among vertices, as well as the emergent global properties. Networks are being used with increasing frequency to analyze large-scale systems. A network, such as PHPN, can take an extraordinarily complex system and reduce it to a relatively simple form, revealing underlying connections and important clustering details that would not be evident from studying individual or non-complex relationships among traits [8].
Formally, a network is a collection of nodes and edges connecting them. The degree, k, of a node is the number of edges incident upon the node, and the degree distribution, P(k) of the network, describes the fraction of nodes in the network with degree k. The degree distribution also characterizes global properties of the graph and how the nodes are connected to one another; for example, if they are connected at random, the nodes' degrees are expected to be homogeneous, and the degree distribution would a uniform binomial distribution. More often in biology, networks are highly heterogeneous, with a "heavy-tailed" degree distribution, placing them in the scale-free family. This means that the degree distribution follows a power law, or exponential decay. Within the network, this translates into the presence of "hubs" -a minority of highly connected nodes. When the degree distribution of a "scale-free" network is plotted on a logarithmic scale, the resulting curve is approximately linear across the top [8]. http://www.biodatamining.org/content/7/1/1 In the case of relatively small networks, it is impossible to demonstrate the presence of a scale-free network. We can, at best, show the existence of a power-law type degree distribution, and not dismiss the scale-free hypothesis. The clustering coefficient (CC) of a network measures the degree to which nodes tend to form closely knit communities with a higher than average connectivity [9]. The CC of networks found in nature, in particular social and biological networks show a higher degree of clustering than that observed in randomized networks of identical size. The average path length of a network (APL) represents the average of the minimum number of edges separating any two vertices.
In our study, we build a bipartite network [10], consisting of two disjoint sets of nodes. The nodes are connected in such a way that the nodes of one set will have no connections between them, but can only be connected to nodes of the other set. The use of a bipartite network is natural when dealing with two different types of data sets (Figure 1b), in our case phenotypes and pathways. Two nodes of the same type cannot connect with each other, so one node can only be connected to a node of the other data type. We used a bipartite networks to construct the relationships of our data.
From the bipartite network, one can project the data onto either of the data spaces (Figure 1a,c). In either single dataset space, the nodes are connected to one another through a vertex of the other space. By ignoring the different types of data, all network properties described above remain valid on the bipartite network (as a single data set network) and on either projection. This type of network gives us three degree-distributions, one for each projection, and one for the bipartite network. Each degree distribution shows how many links each node has. Nodes in a projection of a bipartite network are connected if they share at least one node in the other group. This gives us the ability to visualize connections within a group.

Human disease networks
In recent years there has been a trend toward studying disease through network based analysis of various systems of connections between diseases. The result was the Human Disease Network (HDN) [1]. The nodes in the HDN represent human genetic disorders and the edges represent various connections between disorders, such as gene-gene or protein-protein interactions, to name a few. The HDN is helpful in visualizing connections among human disorders on a large scale. The underlying connections of the HDN Figure 1 Bipartite network schematic. A bipartite network (b) made of 2 data sets the "circles", and the "rectangles". Projections in the "circle" space (a) and in "rectangle" space (c). http://www.biodatamining.org/content/7/1/1 contribute to the understanding of the basis of disorders, which in turn leads to a better comprehension of human diseases.
One study by Goh, et al. [1], explored the HDN built on genes shared by different diseases. Another study, which is similar in some ways to ours, by Li et al. [11] traced the SNPs connecting disease traits. In 2009, Silpa Suthram et al. [12] found that when diseases were compared by an analysis of disease-related mRNA expression data and the human protein interaction network, there were significant similarities between some diseases and between some drug treatments. In 2009, Barrenas et al. [4] further studied the genetic architecture of complex diseases by doing a GWAS, and found that complex disease genes are less central than the essential and monogenic disease genes in the human interactome. In the present work, we expand our study to include not only disease traits, but also behaviors and normal variations in humans, such as hair color, and explore large portions of non-coding variants in the human genome. Links between PTs are based on overlapping biological pathways (Section "Pathway-based human phenotype network").

Pathway-based human phenotype network
In this paper, we chose to mesh the methods and results sections, as we present multiple different algorithms (i.e. to build, filter, and identify the modules in the PHPN). Each subsection presents and applies a new method, building on the resulting network of the previous one.

Building the PHPN
Here we describe our method to construct a network of human phenotypes (traits and diseases) based on shared biological pathways of the associated genes. This is accomplished by linking genes to phenotypes (PTs) from hundreds of GWAS catalogue at NHGRI. Genes were further linked to pathways (PWs) using Reactome. By building these associations, we were able to link phenotypes with genes involved in the same pathways. The steps used to build the network are illustrated in Figure 2 and described as follow: 1. From the NHGRI catalog, extract all PTs and link them to their mapped genes. PTs with no mapped genes are omitted; Figure 2 Model of how the PHPN was built. Phenotype-to-Gene associations were obtained using the NHGRI GWAS catalog, while gene-to-pathway associations were obtained using Reactome. Edges were drawn between phenotypes with overlapping pathways. Edge weights represent the number of overlapping pathways. PT indicates Phenotype, PW indicates Pathway. http://www.biodatamining.org/content/7/1/1 2. From Reactome, extract all genes in the database and link them to their associated pathways; 3. Match the genes associated to each phenotype to their associated pathways; 4. Connect PTs with overlapping pathways with an undirected edge, setting edge weight as the number of overlapping pathways.
We filter out isolate PTs with no connections to the rest of the network. We are only interested in PTs that have been associated with a gene, and their possible shared biology. The original NHGRI database contains over 800 PTs; by removing the isolate nodes, the PHPN contains 401 nodes connected to at least one other node.
This flexible process of building phenotype-gene-pathway associations also allowed us to examine the network from multiple configurations. Specifically, we were also able to construct a pathway network following the same logic as the HDN (Section "Human disease networks"): connecting pathways based on shared phenotypes, as well as a bipartite graph with links between PTs and pathways.

The bipartite network
The Bipartite Network: The bipartite network consists of 1523 vertices (408 PTs, 1115 PWs) and over 10,000 edges, with an average degree k ≈ 7 ( Figure 3). We do not show the intermediate stage of the genes, as this makes the network difficult to interpret. Indeed, highly connected PT are connected to 40+ PWs, and highly connected PW, to 100-300 PTs. Height is clearly associated with most pathways, forming a major hub in the PHPN. However, it is safe to suppose that the size of the height hub represents a bias because it is recorded in most studies. It is unclear what the implications of this and other data biases are.

The unfiltered PHPN
The Unfiltered PHPN: Because we focused this study on phenotypic connections, we projected the bipartite network presented above onto the "phenotype space" (Section "Networks" and Figure 1.) The vertices in this network are only the PTs. We draw an undirected edge e ij between two PTs i and j if they are associate to at least one common pathway. The weight ω ij of an edge e ij is simply the number of pathways the phenotypes have in common. The result of the projection is the unfiltered Human Phenotype Network 4. It has 814 nodes and over 40,000 edges. Once the 406 isolate nodes are removed, the remaining 408 PTs and 41K edges for in a single connected component and an average degreek ≈ 200. Figure 4 offers a taste of how dense the network really is at this stage.  Figure 4 illustrates the sheer density of the unfiltered network and how difficult it becomes to precisely decipher the results. Even when zoomed in (Figure 4b, the network is too dense to provide any easily usable information. The degree distribution in Figure 4c does not give adequate insight into the internal structure of the network. From the results in this section, it was clear that more work had to be done on the "raw" PHPN in order for it to reveal key clinical information, both from a visual and statistical perspective. Below we describe the filtering method used and the new PHPN resulting from this filtering.

Extracting the information backbone
Biological networks, in their raw form, are in general extremely dense. This "hairball effect" makes interpretation nearly impossible, especially from a visual perspective. Networks are, however, first and foremost visual tools. It is their relative intuitiveness and simplicity that makes them attractive for presenting data to a large audience. To make the PHPN usable and streamline our analysis, we need to extract the most significant links from the dense network: the backbone of the PHPN. Because of the scale-free nature of the PHPN, using a global weight (GW) threshold to eliminate edges is inappropriate. Instead, we use a multi-scale filtering algorithm outlined by Serrano et al. [6] to extract the HPN's backbone. In place of a global threshold, the algorithm takes advantage of local fluctuations in edge weight to prune edges, while preserving the network's essential structure and global properties. Specifically, we apply a disparity filter (DF) to the network; an algorithm that uses the null hypothesis that the normalized weights of the edges incident to a given node with degree k are produced by a random assignment from a uniform distribution. For each edge, we calculate the probability that the edge weight is compatible with the null hypothesis, which is given by: where k is the degree of the node to which the edge under consideration is attached, and p ij is the normalized edge weight, given by: where ω ij is the edge weight and s i is the strength of the node under consideration (i.e. the sum of all weights of edges incident to the node). Edges are then preserved based on an imposed significance level α; in other words, for each edge, if α ij < α, then the edge is preserved. It should be noted that for each edge the algorithm for the DF produces two independent values α ij and α ji based on the two nodes connected by the edge.
In order to resolve this, Serrano et al. propose two alternatives: under the OR rule, edges are preserved if (α ij < α OR α ji < α). Under the AND rule, both (α ij < α AND α ji < α) in order for the edge do be preserved. After experimenting with both rule types, we experimentally found the best that conserve the original network properties are obtained using the more restrictive AND rule. This is due to the sheer density of the unfiltered network.
Serrano et al. [6] have shown that the backbone analysis is more successful in extracting meaningful links from dense networks than more conventional reduction algorithms, such as global thresholds, in a variety of data sets, but not to biological data. Specifically, the algorithm reduces the number of edges while maintaining a large fraction of the nodes and weights in the unfiltered network, thus preserving many features of the network at all scales. By charting the changes in the number of edges, nodes, total weight and CC as α is adjusted ( Figure 5), we not only demonstrate how these features are preserved in our filtered Human Phenotype Network, but also provide a rationale for which significance level cutoff to use. Indeed, Serrano et al. [6] have shown that these metrics give sufficient information about the network over varying values of the threshold α in order to ensure an adequate filtering of the network while keeping the backbone intact.
In Figure 5a and the close-up in 5b, we quantitatively identify α ≈ 0.25 that conserves a CC close to that of the original network, and ∼ 90% of the PTs, ∼ 36% of the weights, and only ∼ 8% of the edges. The resulting backbone of the PHPN is presented in Figure 6. To understand the advantages of the DF over a straightforward GW cutoff, we determine the cutoff value in Figure 5c that will result in a global cutoff network that also retains ∼ 8% of the edges. We compare the statistical differences between the resulting graph of these two filtering methods (Figure 7).
Results in Figure 7 clearly demonstrate the advantages of the disparity filter compared to a global weight for an identical number of edges. The DF conserves over 90% of the phenotypes versus ∼ 50% for GW. In conclusion, the backbone keeps more phenotypes than the GW filtering, for the same number of connections, making the network less dense. Moreover, the relatively low average degree, the heavy-tailed degree distribution of the PHPN backbone resulting from the DF filtering, and the high clustering coefficient and short average path length indicate an interesting module structure.

Modules detection
In the medical literature, diseases are grouped in disorder classes according to an ontology of the biomedical domain [13]. In the Goh et al. gene-based HPN, they denote the diseases according to their disorder class [1]. Classes make "bins" in which all diseases are sorted, according to their "natural class". Therefore, all cancers are grouped together, all cardiovascular diseases together, all gastrointestinal disorders together, and so on. We envisage two major drawbacks to this classification method: the semi-arbitrary nature of the classes, based solely on qualitative clinical observations, and not on the quantitative nature of the disorder and its underlying biology. Additionally, the manual classification is extremely tedious and subjective. We argue that in this case, we can achieve interesting results by applying a community-detecting algorithm on the filtered PHPN. This method sorts the phenotypes into classes of phenotypes with shared biology, rather than shared clinical presentation. Communities, or modules, of nodes within the network can be identified by maximizing the modularity, a measure of strength of division of a network into modules [14]. Communities are identified when a group of nodes are found to have more connections between them than would be expected by random chance, often due to some shared properties (or in our case shared biology) between the nodes in the community. The clustering coefficient (CC) measures the degree to which nodes tend to form closely knit communities with a higher than average connectivity, while a high modularity score indicates the interconnectedness, and thus the strength of the communities. The Louvain method of community detection [15] uses a greedy optimization method to maximize the modularity and determine the most favorable division of network into communities. It is a widely accepted algorithm to build communities (or modules) within a network with no expert-knowledge, although other methods, such as Infogram are widely used. Refer to Lancichinetti et al.'s comparative analysis [16] for more details. We run the modules detection algorithm on the backbone of the PHPN, extracting the modules detected ( Figure 6).
The module detection algorithm identified 11 modules, of which 6 are part of the largest connected component, and 5 are small satellite groups of a few phenotypes. Table 1 gives the phenotypes in each group with the highest weighted degree, that is, the strongest connection to PT in the network.
By applying a community detection algorithm to the filtered network, we are able to classify traits and disease by quantifying their shared genetic mechanisms. This The colors correspond to those in Figure 6. The number in parenthesis is the number of PT in the module. http://www.biodatamining.org/content/7/1/1 classification allows us to identify non-intuitive relationships between diseases and traits, elucidating the shared etiology for certain phenotypes.

Clinical and biomedical implications
The appropriateness of the PHPN was assessed by examining specific edges within communities ( Figure 6). Specially, we interrogated pairwise connections within the community shown in blue and asked (1) whether any constitute links between phenotypes previously known to share biological connections and (2) if they do not contain known relationships, can we understand how they may be indirectly linked based on the primary literature; thereby providing novel insights that are not only reasonable but easily visualized using our method.

HDL cholesterol (HDL) and Alzheimer's disease (AD)
The apolipoprotein E (APOE) gene is the most significantly associated gene with AD [17,18] and is also highly associated with multiple lipid traits [19][20][21][22]. The existence of an edge between HDL and AD in our network provides clear proof of principle that PHPN can detect relationships between two PTs known to be associated through a validated biological mechanism. PHPN successfully identified four common genes and six common pathways between HDL and AD ( Table 2). The common pathways identified by PHPN that connect HDL and AD also support existing hypotheses about the lipid, inflammatory, and amyloid mechanisms involved in AD pathogenesis [23][24][25][26]. It is important to note that while PHPN used four common genes to detect the six common pathways between HDL and AD, these pathways harbor numerous potential candidate genes that could be used to further interrogate the genetic architecture of both AD and HDL. The promiscuous nature of the gene to pathway assignment employed by PHPN ensures that the method is robust to missingness of the genes mapped in the NHGRI catalogue.

Iron status biomarkers (IB) and cognitive performance (CP)
There has been substantial evidence that iron is essential for dendritic growth, synaptogenesis, and myelination, and several studies indicate that early iron deficiency can lead to life-long cognitive impairment [27][28][29]. Importantly, upon review of the related literature, we were unable to find any single genes that were associated with both iron biomarkers and cognitive performance. However, given the clinical relevance of iron levels to neurocognitive function [30][31][32], we asked whether PHPN could illuminate any unknown connections between IB and CP. PHPN, as predicted, did not identify any common associated genes between IB and CP; but interestingly, the algorithm identified five enriched biological pathways that were shared between the two traits ( Figure 8, and Table 3). The identification of enriched biological pathways shared between IB and CP, in the absence of any common associated genes, indicates that the connection between these two traits may be explained in part by genes located in the identified pathways that have yet to be adequately interrogated by investigators. The discovery of these shared biological pathways underscores the strength of PHPN in identifying connections between two traits that may not share any direct genic connections. This demonstrates that while PHPN utilizes the information gained from GWAS studies to identify phenotypic connections, even in the absence of explicit genic connections it is still able to identify important relationships between PTs.

von Willebrand factor and FVIII levels (vWF) and hippocampal atrophy (HA)
Two traits that were connected in the PHPN but did not share any common associated genes or any clear-cut biological relationship were vWF and HA. vWF promotes   platelet adhesion to subendothelial tissues at the site of vascular injury and is the carrier protein for coagulation factor VIII (FVIII); FVIII acts as a co-factor in the coagulation cascade accelerating the activation of factor X by factor IX [33,34]. Together, vWF and FVIII levels are important hemostatic factors involved in the pathophysiology of various blood [35,36] and cardiovascular [37,38] conditions. Additionally, circulating vWF is used as a biomarker for inflammation [39]. Hippocampal atrophy (HA) is characterized by decreased hippocampal volume. Because the hippocampus is the region of the brain that is essential for memory formation, abnormalities in this region have been seen in various neurodegenerative disorders such as dementia and AD [40,41]. PHPN identified a connection between vWF and HA with the unifying factor being a single shared pathway (Table 4). Because the relationship between these two PTs was not expected, we examined possible biological connections between the two via literature review. Upon review, we discovered a recently published study that interrogated inflammatory biomarkers for association with hippocampal volume; it is important to note however that the biomarkers assessed in this study did not include vWF [42]. Further research revealed strong associations between atrial fibrillation and both phenotypes; increased levels of vWF associate with incidence of atrial fibrillation [43], and incidence of atrial fibrillation associates with increased hippocampal atrophy [44,45] (Figure 9). Through this analysis, PHPN exposed atrial fibrillation phenotype as a key connector between VWF and HA even though none of these three PTs share any common genetic risk factors, as reported in the GWAS catalogue, but all three phenotypes shared a common biological pathway (Table 4). Therefore, the PHPN was able to identify a possible, and plausible, indirect relationship between vWF and HA through the unifying, http://www.biodatamining.org/content/7/1/1 but independent, phenotype of atrial fibrillation Thus, PHPN provides a novel means to identify inter-relationships between hemostatic, cardiovascular, and neurological conditions that may otherwise have gone unnoticed. It is also interesting to note that the single overlapping pathway between vWF and HA, the KEGG aggregate Metabolic Pathway ([1100]), is an comprehensive pathway consisting of all the metabolic pathways

Discussion
PHPN provides a means of integrating the accumulating wealth of genomic and phenotypic data and computationally identifies significant links between traits, attributes and diseases. This model has tremendous potential as a clinical tool in identifying risk factors for certain diseases, or common drug targets. By constructing a network based on pathways, we were able to associate phenotypes based on the shared biological processes involving common genetic components and pleiotropic effects. Our network of human traits based on ∼ 2, 300 genes, ∼ 1, 200 biological pathways and 800+ phenotypes is more comprehensive than that of previous studies. We combine GWAS data, which associates PT to genes, with the data from Reactome, which links genes to pathways. We extract the backbone of the PHPN using the disparity filter, retaining the significant connection. Our statistical analysis of the network properties places the PHPN in the scale-free family, showing once more how ubiquitous network structures with heavy-tailed degree distributions really are in biological, social, and natural networks. The automatic classification of phenotypes into "phenotype classes", using the network's topological modularity and a standard community detection algorithm, showed very promising results. Indeed, in contrast to what was achieved in previous studies and manual classification, we are able to highlight modules with phenotypes with potentially interesting shared biology, not by arbitrary disease types. Despite its apparent simplicity, PHPN is an adaptable network algorithm that can elucidate both intuitive and previously undiscovered biological connections between PTs, deftly characterizing the shared genetic mechanisms in the former and identifying unifying genetic traits in the latter. The ability to recognize biological connections, quantified by shared genes and their associated biological pathways, between seemingly disparate phenotypes provides researchers with a unique view of the pleiotropic biological environment that underlies the human condition. Discovering additional, novel, connections between phenotypes known to share certain biological traits provides additional information that could be exploited in future hypothesis based studies. Recognizing the connections between different traits/phenotypes is an integral first step in understanding the dynamic, and highly inter-related, genetic architecture underlying most complex disease; once these connections are illuminated they may provide the necessary framework for the generation of novel and innovative therapeutic interventions. For future work, we are interested in integrating more datasets on gene interactions into our network, such as SNPs and protein-protein interactions. Furthermore, we are currently working on three angles, (1) comparing the HPN to the HDN, and other previous work on phenotype networks, (2) running statistical significance tests, such as data set randomization, and finally (3) http://www.biodatamining.org/content/7/1/1 on refining our statistical methods, comparing algorithms for pruning our network and identifying communities that may produce optimal results in extracting the significant interactions in the PHPN.