Mango: combining and analyzing heterogeneous biological networks

Background Heterogeneous biological data such as sequence matches, gene expression correlations, protein-protein interactions, and biochemical pathways can be merged and analyzed via graphs, or networks. Existing software for network analysis has limited scalability to large data sets or is only accessible to software developers as libraries. In addition, the polymorphic nature of the data sets requires a more standardized method for integration and exploration. Results Mango facilitates large network analyses with its Graph Exploration Language, automatic graph attribute handling, and real-time 3-dimensional visualization. On a personal computer Mango can load, merge, and analyze networks with millions of links and can connect to online databases to fetch and merge biological pathways. Conclusions Mango is written in C++ and runs on Mac OS, Windows, and Linux. The stand-alone distributions, including the Graph Exploration Language integrated development environment, are freely available for download from http://www.complex.iastate.edu/download/Mango. The Mango User Guide listing all features can be found at http://www.gitbook.com/book/j23414/mango-user-guide.


Background
In the present Big Data era, one of the great challenges is to be able to compare or integrate diverse data types. Modern biological research produces large and heterogeneous data sets, and there are many ways to categorize or display each type of data. The 2014 Nucleic Acids Research Database Special Issue counted 1552 online biological databases [1]. It is often illuminating, even essential, to examine important biological problems using different types of data. For example, new discoveries often emerge when a biologist is able to interrogate gene expressions in the context of biological pathways [2]. A common method to analyze related data relies on graphs, or networks, where data of various types are linked and key network features or subsets are identified [3][4][5].
Many graph analysis solutions have been written in Java, most notably Cytoscape [6]. Started in 2002, Cytoscape has an impressive array of features. However, like other Java programs, the software slows to non-operational levels when handling large (>1 M link) biological networks due to Java Virtual Machine limitations [7]. Non-Java graph tools either do not provide analysis functions, or provide only libraries which users must incorporate into their own software solutions. Overall, many graph tools focus solely on one functionality, i.e., either analysis or visualization, and require users to integrate two or more tools for one project. Multi-graph comparison and integration are further complicated by differing graph attributes from heterogeneous data sets. Many tools ignore or limit the number of attributes associated with a graph. A comparison of currently available graph analysis and visualization software [6,[8][9][10] is given in Table 1.
To address these limitations, we have developed a stand-alone graph analysis and visualization software environment called Mango to aid biologists and other researchers efficiently integrate and explore heterogeneous networks larger than previously possible. A 4 million link network can be loaded into Mango in 30 seconds on a Mid 2010 Mac mini computer with a 2.4 GHz (Gigahertz) Intel Core 2 Duo processor and 8 GB RAM (random access memory). As a comparison, Cytoscape took 6 minutes to load that same network file on the same computer using its default configurations. Mango possesses the scalability to handle larger networks, the expressive power of a new Graph Exploration Language (Gel) and the convenience of unlimited graph attributes with automatic graph attribute merging and promotion. Within the integrated development environment, Gel commands can be edited, run line-by-line, or saved as scripts to reproduce results. Script files enhance the speed and reproducability of analysis [11]. Mango provides both comprehensive graph analyses and real-time 3-dimensional (3D) visualization. Mango is a cross-platform C++ program that runs on Mac OS X 10.9 or later, Windows 7 or later, and many Linux variants. It is freely available from our website (http://www. complex.iastate.edu/download/Mango) and the Mango User Guide is hosted at GitBook (http://www.gitbook.com/book/j23414/mango-user-guide).

The Mango user interface
Mango updates its display in real-time at each stage of analysis to facilitate the integration and modification of multiple large networks. Mango contains a primary window divided into four areas (Fig. 1). The graph canvas area is fully interactive, responding to mouse and keyboard actions to zoom, move, rotate, and auto-layout the displayed graphs. By dragging and rearranging tabs, multiple graphs can be viewed simultaneously, easing multi-network comparison. Mango functions are mostly carried out through its command console or Gel code editor. The Gel code editor allows commands to be run line-by-line, edited, and saved as Gel script files. Gel script files can then be shared among researchers, reproducing a 3D layout or network analysis pipeline. Finally, the data area lists currently loaded graphs, their sizes and attributes. Interactive real-time network visualization in Mango helps hone and refine each step of analyses. Mango is built on multiple layers of implementation that are seamlessly combined to form an integrated solution for graph analysis (Fig. 2).

The Graph Exploration Language (Gel)
A graph is defined as a set of nodes (V ) and links (E) where a node represents some entity and a link represents a relationship between a pair of entities. In practice, graphs also have added annotations called attributes. Currently, Gel provides four basic data primitives string, int, float and double as well as aggregate data types node (V attr ), link (E attr ) and graph. · Cypher graph query language · 2D layouts only port rather than for visualization · Queries are based on a combination of · Have to click a node or link to see its · Nodes are only labeled by numbers topology and attributes attributes on a separate panel · The whole database is one huge graph Tulip C++ · A set of C++ libraries for graph analysis · 2D visualization · More useful to users who program C++ (v. 4.6.1) · Can also be run as stand-alone program · 3D is available through plug in or python directly · Plug-ins can be created in Python · Had some 3D layout algorithms · More analysis than visualization features NetworkX Python · Python module for graph analysis · Must export to other software or · Useful only as an analysis tool   Each nodes and link type can have any number of attributes of the four primitive types in any order, and each of the attributes has a distinct name and specified data type (e.g. string, int, float, and double). The first attribute in a node type must be a string to denote the node name, and a link is identified by a pair of node names. All node and link attributes have default values, which are usually zero for numeric types or the empty string, but users can define other default values during node and link type declarations. Graphs are defined based on a pair of node and link types. For example, the following Gel code defines and initializes two graphs G A and G B , also shown in Fig. 3a. Node type and link type are defined with the given attributes inside parentheses and brackets; the brackets denote non-directional link types (whereas arrows <> denote directional link types). For example, G A is declared with ntA and ltA, and is also initialized by the graph literals enclosed within the braces. Other than defining a graph in the native graph exploration language, Mango can read graph data in tabular or CSV (comma separated values) format using the import command. A properly formatted graph file lists nodes with their attributes and then links with their attributes. A single line containing a hyphen separates the node list from the link list. The full description of the import command is in the Mango User Guide. Given two graphs A and B, the dotted addition A .+ B combines nodes and links from graph A and graph B. The non-dotted addition A + B combines graph A with links of Graph B whose end nodes are already contained in graph A. Graph subtraction works similarly. Graph mathematic results depend on operand order; attribute merging and promotion are handled automatically as described in the main text but are not shown in this figure Mango system-defined graph attributes are appended to user defined attributes. The system-defined attributes are related to the 3D visualization of a network and define such attributes like node position, node color, or link width. Therefore, generating any 3D visualization is a matter of mapping user defined information attributes to system defined visualization attributes [12]. By dynamically changing these mappings, animations and simulations can be accomplished in Mango. A full listing of the visualization attributes is in the Mango User Guide.

Standards for combining heterogeneous graphs
When combining two or more graphs, much of the confusion stems from what will happen to the nodes and links. Since a graph contains both node and link sets, our formally defined dotted and non-dotted graph mathematic operators allow users to specify node-centric or link-centric operations precisely. Recall the two graphs G A and G B .
Merging nodes and links is represented by the dotted addition.
However, suppose that the user is only concerned with the nodes in G A , such as a set of important genes, and merely wants to combine the new links between those genes from G B . The non-dotted addition merges links from G B only between nodes already in G A .
In a similar fashion, dotted and non-dotted subtraction between two graphs are defined as follows.
Other operations such as producing intersections and bipartite graphs are defined as follows.
The above mathematics can be extended across multiple graphs to create unions ( The graph operations can be mixed and matched to produce more complex results. Figure 3b demonstrates a few of the graph mathematics visually. When graphs are combined in mathematical operations, attributes from two graphs might conflict. For example, the link between b and d nodes in G A may have a weight attribute of 0.4 while the link between b and d nodes in G B may have a weight attribute of 0.3. Gel handles attribute conflicts by giving preference to the left operand. During the operation G A . + G B , the left operand G A takes precedence and the resulting graph will have weight value 0.4. An exception to this rule is when the conflicting attributes in G A happen to be at their default values (default values can be defined by users). In those cases, the attributes of graph G B will be copied. This automatically merges useful non-default information from G B into the resulting graph.
When heterogeneous graphs are combined, their unique attributes can be selectively preserved. Recall that the nodes in G A have attributes id and count while nodes in G B have attributes id and tag.
Because nodes in G B only share the id attribute with G A , when G B is added to G A as in G A . + G B , the count attribute of nodes copied from G B is automatically set to the default value 0 but their tag attribute is ignored. To preserve both G A and G B attributes, users can define a new node type that includes all attributes. This is called attribute promotion. In our example, a new node type containing id, count and tag attributes is defined and used by the new G C to receive all attributes from G A and G B .
However, simply writing G C = G A . + G B will not work as the tag attribute from G B is already lost after the addition of G B to G A but before the result is assigned to G C . The correct steps to preserve graph attributes during heterogeneous graph mathematics are demonstrated below (Fig. 3a): node(string id, int count, string tag) ntC; link[float weight] ltC; graph(ntC, ltC) C=A; // copy id and count attributes from graph A C.+=B; // then merge with tag attributes from graph B Flexible node and link type definition coupled with an intuitive set of attribute promotion and merging rules ease the combination of heterogeneous graphs in Gel. Thus users can focus on graph level operations instead of attribute level selection, sorting, and merging.
Many graph analyses require traversing all nodes and links to perform a calculation based on graph attributes or topology. Gel provides the select command to pull out a subgraph based on user-defined conditions. These conditions can be related to stored attribute values or topology properties. Gel also allows mapping or computing new attribute values across a graph on a per-node or per-link basis with the foreach command, which efficiently applies a set of user-defined calculations across all nodes or links that optionally meet certain conditions. The same command can also be used to tally attribute values across all nodes and links. The following demonstrates the two types of Gel commands: graph(nt,lt) hubs = select node from A where in+out>3; graph(nt,lt) thresh = select link from A where weight>0.2; foreach link in thresh where weight>1.0 set weight=1.0; foreach link in thresh set _r=weight, _g=weight, _b=weight; foreach node in hubs where type=="gene" set _radius=0.2+(in+out)/2.0,count++; In addition to the data types, graph mathematics, automatic attribute handling and traversal commands; Gel also provides commands for object modification, data examination, input and output, code execution, graph construction, and simulation. A growing set of built-in functions for mathematics, visualization control, graph layouts, and statistical reporting are also provided. To explore all Gel commands and functions, type the help command in Mango or consult the online User Guide.
The Mango system and its Graph Exploration Language are data agnostic, meaning that any type of network can be loaded and analyzed -users have total control of node and link attribute definitions and their associations within Mango. Our goal is to make this software widely available to all researchers and promote its use in solving ever more complex biological research problems.

KEGG connect
The KEGG Connect dialog demonstrates how Mango can fetch network data directly from online biological databases. KEGG Connect queries the KEGG (Kyoto Encyclopedia of Genes and Genomes) database (http://www.genome.jp/kegg) and selectively downloads pathways grouped by organisms. Within the downloaded pathway, nodes maintain their 2-dimensional (2D) coordinates from the KEGG visualization. The nodes are colored red, blue, green and yellow representing pathway maps, compounds, genes, and orthologs respectively (Fig. 4). Multiple pathways can be downloaded either as individual networks or as one merged network. If multiple networks are merged, each pathway will be given a different z coordinate value, so the pathways are layered in 3D space. We intend to connect Mango to more biological databases soon.

Results and discussion
We present a few network analysis examples to illustrate the use of Mango in this section. Examples of comparing different types of biological networks and the scalability of Mango to large networks are provided.

Network data collection
Four large E. coli network data sets were collected. The corr 4 M link network was computed using the WGCNA (weighted gene coexpression network analysis) package in R [13] on microarray data measuring the expression of 4454 E. coli genes in cells grown under 10 different conditions (GSE61736, [14]). The path biological pathways of E. coli were downloaded from the KEGG database (http://www.genome.jp/kegg) and combined into a single pathway network. The go network was constructed using E. coli GO (gene ontology) information retrieved from the gene ontology website (http://geneontology. org/page/download-annotations); E. coli genes that share at least one GO term are linked. Finally, the protein-protein interaction (ppi) network was retrieved from the supplementary materials of a 2014 paper [15]. Sizes and attributes for the 4 large networks are summarized in Table 2.

Large heterogeneous network comparison
For all networks, nodes are identified by gene names with no additional attributes, thus the following node type declaration can be shared among the networks: node(string name) nt; All networks have undirected links but differ in their link attributes (the path network does not contain any link attributes), thus the following 4 link type declarations are used to load the different networks: After the node and link type declarations, the corr network, path network, go network, and ppi network can be imported into Mango for all-to-all network comparisons: graph(nt,corr_lt) corr = import("wgcna.csv"); graph(nt,path_lt) path = import("kegg.csv"); graph(nt,go_lt) go = import("go.csv"); graph(nt,ppi_lt) ppi = import("ppi.csv"); For the integration of the networks, a common link type including all available link attributes is declared:  Once the networks are loaded into Mango, Gel mathematics allow network integration and comparisons. For example, the comparison of the corr and path networks are visualized in the top two panels in the left column of Fig. 1. The top middle panel in Fig. 1 is the result of the following Gel intersect operation. The corr-path intersection network contains 961 links with 1020 nodes. The all to all comparisons of these four networks were completed in Mango and the common links among the networks were summarized in Fig. 5. All possible intersections among the four E. coli networks can be worked out with a few lines of Gel code each. Bench-marked time for different types of Gel mathematics between the large corr and path networks are listed in Table 3.

Flexible real-time network exploration and visualization
Over-plotting of nodes and links becomes more of a challenge as network sizes get bigger. For example, the corr and path networks and their combination can be visualized in Mango but provide limited biological interpretation (the left column of panels in Fig. 1). In this example, we continue to explore the intersection of the two networks by querying certain node and link attributes, imposing thresholds to reveal important features, and map these features to network visualization.
First we arrange all nodes in the intersection network along a circle in the x-y plane and map the node connectivity to their z-axis coordinates. Nodes are assigned random colors and higher z-axis node colors are bled down the links to emphasize hubs. Nodes above a threshold are emphasized by increasing their radius and labeling them with gene names and connectivity. Fig. 5 Biological network comparisons. Link intersections among the corr, path, go and ppi networks. The intersections were worked out using Gel commands. WGCNA is the gene-to-gene correlation network corr computed from E. coli microarray data. PPI is the protein-protein interaction network ppi of E. coli. GO is the network go that connects any two E. coli genes sharing at least one gene ontology term. KEGG is the entire KEGG biological pathway network path of E. coli The resulting network layout, called a crown-plot, is shown on the top pane in the middle column of Fig. 1. The hub genes and their links can be pulled into a new subnetwork. The sub-network called hubs is then flattened and spread out using a forcedirected layout built into the graph panel by right-clicking on the panel. The hub genes are raised one level. Genes that are not themselves hubs but connect two or more hubs are raised to a third level. The following Gel code accomplishes all these except the forcedirected layout, which is performed by right-clicking on the panel: auto hubs = select link from intersect where in._radius>0.3 || out._radius>0.3; foreach node in hubs set _x=rand(-5,5),_y=rand(-5,5),_z=0; / * right click on graph to start and stop force-directed algorithm * / foreach node in hubs where _radius>0.3 set _z=3; foreach node in hubs where __radius<0.3 && (in+out)>1 set _z=6; The 3-layer hubs network is shown in the lower panel in the middle column of Fig. 1, which contains other genes on the bottom layer, hub genes on the middle layer and inbetweener genes on the top layer. It is worth mentioning that the in-betweener genes on layer 3 would have been obscured by other genes in a simple list of genes ordered by connectivity. We can further pull out the hubs and in-betweeners into another subnetwork for closer inspection with the following Gel code: auto bipartite=select node from hubs where (in+out)>1; int i=-20; foreach node in bipartite where _radius>0.3 set _x=-10,_y=i, i++; i=-50; foreach node in bipartite where _radius<=0.3 set _x=10,_y=i, i++; foreach node in bipartite set _text=name; This sub-network is laid out as a bipartite graph shown on the right panel in Fig. 1, with hubs on the left and the in-betweeners on the right. This example shows how to map informational attributes of a graph to its visual attributes using Mango. The resulting visual displays help the user decide threshold values, extract sub-networks of interest, and further explore the data.

Microarray expression combined with KEGG biological pathways
E. coli gene expression under control and multiple treatment conditions were measured by microarrays (GSE61736, [14]). A subset of the data containing one control and one treatment expression values was loaded into Mango and overlaid onto downloaded E. coli KEGG biological pathways. The expression data, E. coli KEGG pathways, and Gel script are available for download from https://github.com/j23414/Mango_Workshop.
The results of the visualization can be seen in Fig. 6. Genes are colored green or red where their expression levels are up or down relative to the control condition. KEGG pathway components that do not have mapped gene expression values are colored gray. Compounds are colored blue and are largely ignored although they could be used to incorporate metabolomic concentration values. The Gel commands to color gene nodes are given below: foreach node in sum where tr2==control && type=="gene" set _r=0.2,_g=0.2,_b=0.2; foreach node in sum where tr2>control && type=="gene" set _r=0,_g=1,_b=0; foreach node in sum where tr2<control && type=="gene" set _r=1,_g=0,_b=0; Fig. 6 Gene expression combine with KEGG. A 3D KEGG network visualization comparing the E. coli gene expression values obtained under a treatment condition and a control condition. In addition to coloring and resizing the genes (i.e., node) of the network based on expression changes related to the control, pathway links are also highlighted in green or red depending on up or down expressed genes they connect in a pathway. The highlighted links allow a whole pathway to be easily discerned as up or down regulated More than coloring nodes in a network, we are able to color the links and thereby highlight entire pathways that are up or down-regulated. This is possible because KEGG pathways also contain gene to gene links, not just gene to compound links.
save "sum.txt",sum; clear; // clears all data objects run "sum.txt"; // reloads the sum network Mango networks are saved natively into Gel commands, thus running the saved code recreates the original graphs in Mango. In addition, the networks can be exported to tabular data using the export command. The tabular data can then be read by many other software programs, e.g., Excel, R, Matlab, Cytoscape, and other graph software or databases. Full descriptions of the interoperability and other features of Mango are available in the User Guide.

Conclusion
We have developed a powerful new program Mango for multi-network analysis and visualization. Mango enables scientists to test hypotheses on large heterogeneous networks, identify crucial features, and extract analysis results all within its integrated environment. Compared with existing programs, Mango extends the capability and convenience of large heterogeneous data analysis on a personal computer.
The Mango system was designed to be data agnostic, meaning that any type of network data can be loaded and analyzed -users have total control on node and link attribute definitions and their associations within Mango. Mango can load networks with millions of links, integrate and explore large amounts of data following Gel commands, and help users deduce predictions or outcomes that can be validated in labs. It is our goal to make this software widely available to all researchers to promote its use in solving ever more complex biological research problems. As Mango developers, we will continue to provide support and further develop the software according to user needs.

Availability and requirements
• Project name: Mango 1.24.
• Any restriction to use by non-academics: Specific restrictions included with each distribution and license agreement.