- Software article
- Open Access
- Open Peer Review
PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia
BioData Mining volume 11, Article number: 14 (2018)
Investigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies.
We developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data.
The PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.
The number of publicly available genome-wide datasets is growing rapidly . High-throughput sequencing technologies that measure gene expression quickly with high accuracy and low cost continue to enable this growth . Expanding public data repositories [3, 4] have laid the foundation for computational methods that consider entire compendia of gene expression data to extract biological patterns . These patterns may be difficult to detect in measurements from a single experiment. Unsupervised approaches, which identify important signals in the data without being constrained to previously-described patterns, may discover new expression modules and thus will complement supervised methods, particularly for exploratory analyses [6, 7].
Feature extraction methods are a class of unsupervised algorithms that can reveal unannotated biological processes from genomic data . Each feature can be defined by a subset of influential genes, and these genes suggest the biological or technical pattern captured by the feature. These features, like pathways, are often considered individually [7, 8]. When examined in the context of knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) , most features are significantly enriched for more than one biological gene set . In this work, we refer to such a gene set by the colloquial term, pathway. It follows then that such features can be described by sets of functionally related pathways. We introduce the PathCORE-T (identifying pathway co-occurrence relationships in transcriptomic data) software, which implements existing methods that jointly consider features and gene sets to map pathways with shared transcriptional responses.
PathCORE-T offers a data-driven approach for identifying and visualizing transcriptional pathway-pathway relationships. In this case, relationships are drawn based on the sets of pathways, annotated in a resource of gene sets, occurring within constructed features. Because PathCORE-T starts from a feature extraction model, the number of samples in the compendium used for model generation and the fraction of samples needed to observe a specific biological or technical pattern is expected to vary by feature extraction method. Pathways must be perturbed in a sufficient fraction of experiments in the data compendium to be captured by any such method. To avoid discovering relationships between pathways that share many genes—which could more easily be discovered by directly comparing pathway membership—we implement an optional pre-processing step that corrects for genes shared between gene sets, which Donato et al. refer to as pathway crosstalk . Donato et al.’s correction method, maximum impact estimation, has not previously been implemented in open source software. We have released our implementation of maximum impact estimation as its own Python package (PyPI package name: crosstalk-correction) so that it can be used independently of PathCORE-T. Applying this correction in PathCORE-T software allows a user to examine relationships between gene sets based on how genes are expressed as opposed to which genes are shared.
We apply PathCORE-T to a microbial and a cancer expression dataset, each analyzed using different feature extraction methods, to demonstrate its broad applicability. For the microbial analysis, we created a network of KEGG pathways from recently described ensemble Analysis using Denoising Autoencoders for Gene Expression (eADAGE) models trained on a compendium of Pseudomonas aeruginosa (P. aeruginosa) gene expression data (doi:https://doi.org/10.5281/zenodo.583694) . We provide a live demo of the PathCORE-T web application for this network: users can click on edges in the network to review the expression levels of associated genes in the original compendium (https://pathcore-demo.herokuapp.com/PAO1). To show its use outside of the microbial space, we also demonstrate PathCORE-T analysis of Pathway Interaction Database (PID)-annotated  non-negative matrix factorization (NMF) features [12, 13] extracted from The Cancer Genome Atlas’s (TCGA) pan-cancer dataset of 33 different tumor types (doi:https://doi.org/10.5281/zenodo.56735) .
In addition to visualizing the results of these two applications, the PathCORE-T web interface (https://pathcore-demo.herokuapp.com/) links to the documentation and source code for our implementation and example usage of PathCORE-T. Methods implemented in PathCORE-T are written in Python and pip-installable (PyPI package name: PathCORE-T). Examples of how to use these methods are provided in the PathCORE-T analysis repository (https://github.com/greenelab/PathCORE-T-analysis). In addition to scripts that reproduce the eADAGE and NMF analyses described in this paper, the PathCORE-T-analysis repository includes a Jupyter notebook (https://goo.gl/VuzN12) with step-by-step descriptions for the complete PathCORE-T framework.
Our approach diverges from other algorithms that we identified in the literature in its intent: PathCORE-T finds pathway pairs within a biological system that are overrepresented in features constructed from diverse transcriptomic data. This complements other work that developed models specific to a single condition or disease. Approaches designed to capture pathway-pathway interactions from gene expression experiments for disease-specific, case-control studies have been published [15, 16]. For example, Pham et al. developed Latent Pathway Identification Analysis to find pathways that exert latent influences on transcriptionally altered genes . Under this approach, the transcriptional response profiles for a binary condition (disease/normal), in conjunction with pathways specified in the KEGG and functions in Gene Ontology (GO) , are used to construct a pathway-pathway network where key pathways are identified by their network centrality scores . Similarly, Pan et al. measured the betweenness centrality of pathways in disease-specific genetic interaction and coexpression networks to identify those most likely to be associated with bladder cancer risk . These methods captured pathway relationships associated with a particular disease state.
Global networks identify relationships between pathways that are not disease- or condition-specific. One such network, detailed by Li et al., relied on publicly available protein interaction data to determine pathway-pathway interactions . Two pathways were connected in the network if the number of protein interactions between the pair was significant with respect to the computed background distribution. Such approaches rely on databases of interactions, though the interactions identified can be subsequently used for pathway-centric analyses of transcriptomic data [20, 21]. Pita-Juárez et al. created the Pathway Coexpression Network (PCxN) as a tool to discover pathways correlated with a pathway of interest . They estimated correlations between pathways based on the expression of their underlying genes (as annotated in MSigDB) across a curated compendium of microarray data . Software like PathCORE-T that generates global networks of pathway relationships from unsupervised feature analysis models built using transcriptomics data has not yet been published.
The intention of PathCORE-T is to work from transcriptomic data in ways that do not give undue preference to combinations of pathways that share genes. Other methods have sought to consider shared genes between gene sets, protein-protein interactions, or other curated knowledgebases to define pathway-pathway interactions [20, 21, 23,24,25]. For example, Glass and Girvan described another network structure that relates functional terms in GO based on shared gene annotations . In contrast with this approach, PathCORE-T specifically removes gene overlap in pathway definitions before they are used to build a network. Our software reports pathway-pathway connections overrepresented in gene expression patterns extracted from a large transcriptomic compendium while controlling for the fact that some pathways share genes.
PathCORE-T identifies functional links between known pathways from the output of feature construction methods applied to gene expression data (Fig. 1a, b). The result is a network of pathway co-occurrence relationships that represents the grouping of biological gene sets within those features. We correct for gene overlap in the pathway annotations to avoid identifying co-occurrence relationships driven by shared genes. Additionally, PathCORE-T implements a permutation test for evaluating and removing edges—pathway-pathway relationships—in the resulting network that cannot be distinguished from a null model of random associations. Though we refer to the relationships in the network as co-occurrences, it is important to note that the final network displays co-occurrences that have been filtered based on this permutation test (Fig. 1c).
Our software is written in Python and pip-installable (PyPI package name: PathCORE-T), and examples of how to use the methods in PathCORE-T are provided in the PathCORE-T-analysis repository (https://github.com/greenelab/PathCORE-T-analysis). We recommend that those interested in using the PathCORE-T software consult the documentation and scripts in PathCORE-T-analysis. Each of the functions in PathCORE-T that we describe here can be used independently; however, we expect most users to employ the complete approach for interpreting pathways shared in extracted features (Fig. 1).
PathCORE-T requires the following inputs:
A weight matrix that connects each gene to each feature. We expect that this results from the application of a feature construction algorithm to a compendium of gene expression data. The primary requirements are that features must contain the full set of genes in the compendium and genes must have been assigned weights that quantify their contribution to a given feature. Accordingly, a weight matrix will have the dimensions n x k, where n is the number of genes in the compendium and k is the number of features constructed. In principal component analysis (PCA), this is the loadings matrix ; in independent component analysis (ICA), it is the unmixing matrix ; in ADAGE or eADAGE it is termed the weight matrix [5, 7]; in NMF it is the matrix W, where the NMF approximation of the input dataset A is A ~ WH . In addition to the scripts we provide for the eADAGE and NMF examples in the PathCORE-T analysis repository, we include a Jupyter notebook (https://goo.gl/VuzN12) that demonstrates how a weight matrix can be constructed by applying ICA to the P. aeruginosa gene compendium.
Gene signature rule(s). To construct a pathway co-occurrence network, the weight matrix must be processed into gene signatures by applying threshold(s) to the gene weights in each feature—we refer to these as gene signature rules. Subsequent pathway overrepresentation will be determined by the set of genes that makes up a feature’s gene signature. These are often the weights at the extremes of the distribution. How gene weights are distributed will depend on the user’s selected feature construction algorithm; because of this, a user must specify criterion for including a gene in a gene signature. PathCORE-T permits rules for a single gene signature or both a positive and a negative gene signature. The use of 2 signatures may be appropriate when the feature construction algorithm produces positive and negative weights, the extremes of which both characterize a feature (e.g. PCA, ICA, ADAGE or eADAGE). Because a feature can have more than one gene signature, we maintain a distinction between a feature and a feature’s gene signature(s).
A list of pathway definitions, where each pathway is defined by a set of genes (e.g. KEGG pathways, PID pathways, GO biological processes). We provide the files for the P. aeruginosa KEGG pathway definitions and the Nature-NCI PID pathway definitions in the PathCORE-T analysis repository (https://github.com/greenelab/PathCORE-T-analysis/tree/master/data).
Weight matrix construction and signature definition
In practice, users can obtain a weight matrix from many different methods. For the purposes of this paper, we demonstrate generality by constructing weight matrices via eADAGE and NMF.
eADAGE is an unsupervised feature construction algorithm developed by Tan et al.  that uses an ensemble of neural networks (an ensemble of ADAGE models) to capture biological patterns embedded in the expression compendium. We use models from Tan et al. . In that work, Tan et al. evaluated multiple eADAGE model sizes to identify that k = 300 features was an appropriate size for the current P. aeruginosa compendium. The authors also compared eADAGE to two other commonly used feature construction approaches, PCA and ICA . Tan et al. produced 10 eADAGE models that each extracted k = 300 features from the compendium of genome-scale P. aeruginosa data. Because PathCORE-T supports the aggregation of co-occurrence networks created from different models on the same input data, we use all 10 of these models in the PathCORE-T analysis of eADAGE models (doi:https://doi.org/10.5281/zenodo.583172).
Tan et al. refers to the features constructed by eADAGE as nodes. They are represented as a weight matrix of size n x k, where n genes in the compendium are assigned positive or negative gene weights, according to a standard normal distribution, for each of the k features. Tan et al. determined that the gene sets contributing the highest positive or highest negative weights (+/− 2.5 standard deviations) to a feature described gene expression patterns across the compendium, and thus referred to the gene sets as signatures. Because a feature’s positive and negative gene signatures did not necessarily correspond to the same biological processes or functions, Tan et al. analyzed each of these sets separately . Tan et al.’s gene signature rules are specified as an input to the PathCORE-T analysis as well.
We also constructed an NMF model for the TCGA pan-cancer dataset. Given an NMF approximation of A ~ WH , where A is the input expression dataset of size n x s (n genes by s samples), NMF aims to find the optimal reconstruction of A by WH such that W clusters on samples (size n x k) and H clusters on genes (size k x s). In order to match the number of features constructed in each eADAGE model by Tan et al., we set k, the desired number of features, to be 300 and used W as the input weight matrix for the PathCORE-T software. We found that the gene weight distribution of an NMF feature is right-skewed and (as the name suggests) non-negative (Additional file 1 Figure S1). In this case, we defined a single gene signature rule: an NMF feature’s gene signature is the set of genes with weights 2.0 standard deviations above the mean weight of the feature.
The selection of k = 300 for the NMF model allowed us to make the eADAGE-based and NMF-based case studies roughly parallel. We verified that 300 components was appropriate by evaluating the percentage of variance explained by PCA applied to the TCGA dataset. In general, the principal components explained very little variance—the first principal component only explained 11% of the variance. At 300 components, the proportion of variance explained was 81%.
As an additional analysis, we determined the number of components (k = 24) where each additional component explained less than 0.5% of the variance. We found that using a very small number of constructed features resulted in a substantial loss of power: PathCORE-T analysis with a single k = 24 model yielded no significant edges after permutation test. However, PathCORE-T can be applied over multiple models as long as the feature construction method produces different solutions depending on random seed initialization. We performed 10 factorizations to generate an aggregate of 10 k = 24-feature NMF models and found that the resulting co-occurrence network was denser (364 edges) than our k = 300 factor network (119 edges). 65 edges were found in both networks. These shared edges had higher weights, on average, in both networks compared to edges unique to each network (https://goo.gl/vnDVNA).
Construction of a pathway co-occurrence network
We employ a Fisher’s exact test  to determine the pathways significantly associated with each gene signature. When considering significance of a particular pathway, the two categories of gene classification are as follows: (1) presence or absence of the gene in the gene signature and (2) presence or absence of the gene in the pathway definition. For each pathway in the input list of pathway definitions, we specify a contingency table and calculate its p-value, which is corrected using the Benjamini—Hochberg  procedure to produce a feature-wise false discovery rate (FDR). A pathway with an adjusted p-value that is less than the user-settable FDR significance cutoff, alpha (default: 0.05), is considered significantly enriched in a given gene signature. This cutoff value should be selected to most aid user interpretation of the model. The next step of PathCORE-T is to convert pathway-node relationships into pathway-pathway relationships. For this, we apply a subsequent permutation test over pathway-pathway edge weights that accounts for the frequency at which pathways are observed as associated with features. This permutation produces a p-value for each edge. Two pathways co-occur, or share an edge in the pathway co-occurrence network, if they are both overrepresented in a gene signature. The weight of each edge in the pathway-pathway graph corresponds to number of times such a pathway pair is present over all gene signatures in a model (Fig. 2a).
The network that results from the preceding method is densely connected, and many edges may be spurious. To remove correlations that cannot be distinguished from random pathway associations, we define a statistical test that determines whether a pathway-pathway relationship appearing x times in a k-feature model is unexpected under the null hypothesis—the null hypothesis being that the relationship does not appear more often than it would in a random network. We create N weighted null networks, where each null network is constructed by permuting overrepresented pathways across the model’s gene signatures while preserving the number of pathways for which each gene signature is enriched (Fig. 2b). N is a user-settable parameter: the example PathCORE-T analyses we provide specify an N of 10,000. Increasing the value of N leads to more precise p-values, particularly for low p-values, but comes at the expense of additional computation time.
In the case where we have positive and negative gene signatures, overrepresentation can be positive or negative. Because certain pathways may display bias toward one side—for example, a pathway may be overrepresented more often in features’ positive gene signatures—we perform the permutation separately for each side. The N random networks produce the background weight distribution for every observed edge; significance can then be assessed by comparing the true (observed) edge weight against the null. The p-value for each edge e is calculated by summing the number of times a random network contained e at a weight greater than or equal to its observed edge weight and dividing this value by N. Following Benjamini—Hochberg FDR correction by the number of edges in the observed network, pathway-pathway relationships with adjusted p-values above alpha (user-settable default: 0.05) are removed from the network of co-occurring pathways (Fig. 2c). The threshold alpha value is a configurable parameter, and the user should select an FDR that best balances the costs and consequences of false positives. For highly exploratory analyses in which it may be helpful to have more speculative edges, this value can be raised. For analyses that require particularly stringent control, it can be lowered.
Because the expected weight of every edge can be determined from the N random networks (by taking the sum of the background weight distribution for an edge and dividing it by N), we can divide each observed edge weight by its expected weight (dividing by 1 if the expected edge weight is 0 based on the N permutations) to get the edge’s odds ratio. Edges in the final network are weighted by their odds ratios.
Gene overlap correction
Pathways can co-occur because of shared genes (Fig. 3a, b, d). Though some approaches use the overlap of genes to identify connected pathways, we sought to capture pairs of pathways that persisted even when this overlap was removed. The phenomenon of observing enrichment of multiple pathways due to gene overlap has been previously termed as “crosstalk,” and Donato et al. have developed a method to correct for it . Due to confusion around the term, we refer to this as overlapping genes in this work, except where specifically referencing Donato et al. Their approach, called maximum impact estimation, begins with a membership matrix indicating the original assignment of multiple genes to multiple pathways. It uses expectation maximization to estimate the pathway in which a gene contributes its greatest predicted impact (its maximum impact) and assigns the gene only to this pathway . This provides a set of new pathway definitions that no longer share genes (Fig. 3c, e).
There was no existing open source implementation of this algorithm, so we implemented Donato et al.’s maximum impact estimation as a Python package (PyPI package name: crosstalk-correction). This software is separate from PathCORE-T because we expect that it may be useful in its own right for other analytical workflows, such as differential expression analysis. The procedure is written using NumPy functions and data structures, which allows for efficient implementation of array and matrix operations in Python .
In PathCORE-T, we used this software to resolve overlapping genes before pathway overrepresentation analysis. Overlap correction is applied to each feature of the model independently. This most closely matches the setting evaluated by the original authors of the method. Work on methods that resolves overlap by using information shared across features may provide opportunities for future enhancements but was deemed to be out of the scope of a software contribution.
With this step, the pathway co-occurrence network identifies relationships that are not driven simply by the same genes being annotated to multiple pathways. Without this correction step, it is difficult to determine whether a co-occurrence relationship can be attributed to the features extracted from expression data or gene overlap in the two pathway annotations. We incorporate this correction into the PathCORE-T workflow by default; however, users interested in using PathCORE-T to find connections between overlapping gene sets can choose to disable the correction step.
PathCORE-T network visualization and support for experimental follow-up
The PathCORE-T analysis workflow outputs a list of pathway-pathway relationships, or edges in a network visualization, as a text file. An example of the KEGG P. aeruginosa edges file is available for download on the demo application: http://pathcore-demo.herokuapp.com/quickview. While we chose to represent pathway-pathway relationships as a network, users can use this file output to visualize the identified relationships as an adjacency matrix or in any other format they choose.
As an optional step, users can set up a Flask application for each PathCORE-T network. Metadata gathered from the analysis are saved to TSV files, and we use a script to populate collections in a MongoDB database with this information. The co-occurrence network is rendered using the D3.js force-directed graph layout . Users can select a pathway-pathway relationship in the network to view a new page containing details about the genes annotated to one or both pathways (Fig. 4a).
We created a web interface for deeper examination of relationships present in the pathway co-occurrence network. The details we included in an edge-specific page (1) highlight up to twenty genes—annotated to either of the two pathways in the edge—contained in features that also contain this edge, after controlling for the total number of features that contain each gene, and (2) display the expression levels of these genes in each of the fifteen samples where they were most and least expressed. The quantity of information (twenty genes, thirty samples total) we choose to include in an edge page is intentionally limited so that users can review it in a reasonable amount of time.
To implement the functionality in (1), we computed an odds ratio for every gene annotated to one or both pathways in the edge. The odds ratio measures how often we observe a feature enriched for both the given gene and the edge of interest relative to how often we would expect to see this occurrence. We calculate the proportion of observed cases and divide by the expected proportion--equivalent to the frequency of the edge appearing in the model’s features.
Let k be the number of features from which the PathCORE-T network was built. kG is the number of features that contain gene G (i.e. G is in kG features’ gene signatures), kE the number of features that contain edge E (i.e. the two pathways connected by E are overrepresented in kE features), and kG & E the number of features that contain both gene G and edge E. The odds ratio is computed as follows:
An odds ratio above 1 suggests that the gene is more likely to appear in features enriched for this pair of pathways. In the web interface, we sort the genes by their odds ratio to highlight genes most observed with the co-occurrence relationship.
The information specified in (2) requires an “expression score” for every sample. A sample expression score is calculated using the genes we selected in goal (1): it is the average of the normalized gene expression values weighted by the normalized gene odds ratio. Selection of the most and least expressed samples is based on these scores. We use two heatmaps to show the (maximum of twenty) genes’ expression values in each of the fifteen most and least expressed samples (Fig. 4b).
For each sample in an edge page, a user can examine how the expression values of the edge’s twenty genes in that sample compare to those recorded for all other samples in the dataset that are from the same experiment (Fig. 4c). Genes that display distinct expression patterns under a specific setting may be good candidates for follow-up studies.
Unsupervised methods can identify previously undiscovered patterns in large collections of data. PathCORE-T overlays curated knowledge after feature construction to help researchers interpret constructed features in the context of existing knowledgebases. Specifically, PathCORE-T aims to clarify how expert-annotated gene sets work together from a gene expression perspective. PathCORE-T starts from an unsupervised feature construction model. Before applying the software, users should evaluate models to make sure that they capture biological features in their dataset. Model evaluation can be performed in numerous ways depending on the setting and potential assessments include consistency across biological replicates, reconstruction error given a fixed dimensionality, and independent validation experiments. Tan et al. described several ways that models could be evaluated . Datasets will vary in terms of their amenability to analysis by different model-building strategies, and researchers may wish to consult a recent review for more discussion of feature construction methods .
We implemented the methods contained in the PathCORE-T software in Python. The implementations of the primary steps are pip-installable (PyPI package name: PathCORE-T), and examples of how to use the methods in PathCORE-T are provided in the PathCORE-T-analysis repository (https://github.com/greenelab/PathCORE-T-analysis). We also implemented an optional step, which corrects for overlapping genes between pathway definitions, described by Donato et al. . Though the algorithm had been described, no publicly available implementation existed. We provide this overlap correction algorithm as a Python package (PyPI package name: crosstalk-correction) available under the BSD 3-clause license. Each component of PathCORE-T can be used independently of each other (Fig. 1c).
Here, we present analyses that can be produced by applying the full PathCORE-T pipeline to models created from a transcriptomic compendium by an unsupervised feature construction algorithm. Input pathway definitions are “overlap-corrected” for each feature before enrichment analysis. An overlap-corrected, weighted pathway co-occurrence network is built by connecting the pairs of pathways that are overrepresented in features of the model. Finally, we remove edges that cannot be distinguished from a null model of random associations based on the results of a permutation test.
Case study: P. aeruginosa eADAGE models annotated with KEGG pathways
We used PathCORE-T to create a network of co-occurring pathways out of the expression signatures extracted by eADAGE from a P. aeruginosa compendium . For every feature, overlap correction was applied to the P. aeruginosa KEGG pathway annotations and overlap-corrected annotations were used in the overrepresentation analysis. PathCORE-T aggregates multiple networks by taking the union of the edges across all networks and summing the weights of common pathway-pathway connections. We do this to emphasize the co-occurrence relationships that are more stable —that is, the relationships that appear across multiple models. Finally, we removed edges in the aggregate network that were not significant after FDR correction when compared to the background distributions generated from 10,000 permutations of the network. Used in this way, PathCORE-T software allowed for exploratory analysis of an existing well-validated model.
The eADAGE co-occurrence network that resulted from our exploratory analysis contained a number of pathway-pathway relationships that have been previously characterized by other means (Fig. 5). Three glucose catabolism processes co-occur in the network: glycolysis, pentose phosphate, and the Entner-Doudoroff pathway (Fig. 5a). We also found a cluster relating organophosphate and inorganic phosphate transport- and metabolism-related processes (Fig. 5b). Notably, phosphate uptake and acquisition genes were directly connected to the hxc genes that encode a type II secretion system. Studies in P. aeruginosa clinical isolates demonstrated that the Hxc secretion system was responsible for the secretion of alkaline phosphatases, which are phosphate scavenging enzymes [35, 36] and the phosphate binding DING protein . Furthermore, alkaline phosphatases, DING and the hxc genes are regulated by the transcription factor PhoB which is most active in response to phosphate limitation. The identification of this relationship by PathCORE-T as a global pattern suggested the role of type II secretion and phosphate limitation seen in a limited number of isolates may be generalizable to broader P. aeruginosa biology. As shown in Fig. 5c, we also identified linkages between two pathways involved in the catabolism of sulfur-containing molecules, taurine and methionine, and the general sulfur metabolism process. Other connections between pathways involved in the transport of iron (ferrienterobactin binding)  and zinc (the znu uptake system ) were identified (Fig. 5d). Interestingly, genes identified in the edge between the zinc transport and MacAB-TolC pathways included the pvd genes involved in pyoverdine biosynthesis and regulation, a putative periplasmic metal binding protein, as well as other components of an ABC transporter (genes PA2407, PA2408, and PA2409 at https://goo.gl/bfqOk8) . PathCORE-T suggested a relationship between zinc and iron pathways in P. aeruginosa transcriptional data though such a relationship has not yet been described. Structural analysis of the iron-responsive regulator Fur found that it also productively binds zinc in E. coli and Bacillus subtilis providing a mechanism by which these pathways may be linked [41, 42].
The network constructed using the PathCORE-T framework had 203 edges between 89 pathways. For comparison, we constructed a KEGG pathway-pathway network where edges were drawn between pathways with significant gene overlap (FDR-corrected hypergeometric test < 0.05). The overlap-based network had 406 edges between 158 pathways. Only 35 of the edges in the PathCORE-T network were between pathways that shared genes, with an average Jaccard Index of only 0.035. The network constructed using PathCORE-T (with overlap-correction applied by default) captured pathway co-occurrences not driven by shared genes between pathways.
Case study: TCGA’s pan-cancer compendium analyzed by NMF with PID pathways
PathCORE-T is not specific to a certain dataset, organism, or feature construction method. We constructed a 300-feature NMF model of TCGA pan-cancer gene expression data, which is comprised of 33 different cancer-types from various organ sites and applied the PathCORE-T software to those features. We chose NMF because it has been used in previous studies to identify biologically relevant patterns in transcriptomic data  and by many studies to derive molecular subtypes [43,44,45]. The 300 NMF features were analyzed using overlap-corrected PID pathways, a collection of 196 human cell signaling pathways with a particular focus on processes relevant to cancer .
PathCORE-T detected modules of co-occurring pathways that were consistent with our current understanding of cancer-related interactions (Fig. 6). Because cancer-relevant pathways were used, it was not surprising that cancer-relevant pathways appeared. However, the edges between those pathways were also encouraging. For example, a module composed of a FoxM1 transcription factor network, an E2F transcription factor network, Aurora B kinase signaling, ATR signaling, PLK1 signaling, and members of the Fanconi anemia DNA damage response pathway was densely connected (Fig. 6a). The connections in this module recapitulated well known cancer hallmarks including cellular proliferation pathways and markers of genome instability, such as the activation of DNA damage response pathways . We found that several pairwise pathway co-occurrences corresponded with previously reported pathway-pathway interactions [47,48,49]. We also observed a hub of pathways interacting with Wnt signaling, among them the regulation of nuclear Beta-catenin signaling, FGF signaling, and BMP signaling (Fig. 6b). The Wnt and BMP pathways are functionally integrated in biological processes that contribute to cancer progression . Additionally, Wnt/Beta-catenin signaling is a well-studied regulatory system, and the effects of mutations in Wnt pathway components on this system have been linked to tumorigenesis . Wnt/Beta-catenin and FGF together influence the directional migration of cancer cell clusters .
Two modules in the network related to angiogenesis, or the formation of new blood vessels (Fig. 6c, d). Tumors transmit signals that stimulate angiogenesis because a blood supply provides the necessary oxygen and nutrients for their growth and proliferation. One module relates angiogenesis factors to cell proliferation. This module connected the following pathways: PDGFR-beta signaling, FAK-mediated signaling events, VEGFR1 and VEGFR2-mediated signaling events, nuclear SMAD2/3 signaling regulation, and RB1 regulation (Fig. 6c). These functional connections are known to be involved in tumor proliferation [53,54,55]. The other module indicated a relationship by which the VEGF pathway interacts with the S1P3 pathway through Beta3 integrins (Fig. 6d). S1P3 is a known regulator of angiogenesis , and has been demonstrated to be associated with treatment-resistant breast cancer and poor survival . Moreover, this interaction module has been observed to promote endothelial cell migration in human umbilical veins . Taken together, this independent module may suggest a distinct angiogenesis process activated in more aggressive and metastatic tumors that is disrupted and regulated by alternative mechanisms .
Finally, PathCORE-T revealed a large, densely connected module of immune related pathways (Fig. 6e). We found that this module contains many co-occurrence relationships that align with immune system processes. One such example is the well characterized interaction cycle formed by T-Cell Receptor (TCR) signaling in naïve CD4+ T cells and IL-12/IL-4 mediated signaling events [60,61,62]. At the same time, PathCORE-T identifies additional immune-related relationships. We observed a cycle between the three transcription factor networks: ATF-2, AP-1, and CaN-regulated NFAT-dependent transcription. These pathways can take on different, often opposing, functions depending on the tissue and subcellular context. For example, ATF-2 can be an oncogene in one context (e.g. melanoma) and a tumor suppressor in another (e.g. breast cancer) . AP-1, comprised of Jun/Fos proteins, is associated with both tumorigenesis and tumor suppression due to its roles in cell survival, proliferation, and cell death . Moreover, NFAT in complex with AP-1 regulates immune cell differentiation, but dysregulation of NFAT signaling can lead to malignant growth and tumor metastasis . The functional association observed between the ATF-2, AP-1, and NFAT cycle together within the immunity module might suggest that dysregulation within this cycle has profound consequences for immune cell processes and may trigger variable oncogenic processes.
Just as we did for the eADAGE-based P. aeruginosa KEGG pathways case study, we constructed a network only from PID pathways with significant gene overlap. The network constructed using PathCORE-T and NMF features had 119 edges between 57 pathways. The overlap-based network was much denser: it had 3826 edges between 196 pathways. This suggested a high degree of overlap between PID pathways. For the PathCORE-T NMF co-occurrence network, 96 of the edges were between pathways that shared genes. However, the average Jaccard Index for these pathway pairs remained low, at 0.058.
Unsupervised analyses of genome-scale datasets that summarize key patterns in the data have the potential to improve our understanding of how a biological system operates via complex interactions between molecular processes. Feature construction algorithms capture coordinated changes in the expression of many genes as features. The genes that contribute most to each feature co-vary. However, interpreting the features generated by unsupervised approaches remains challenging. PathCORE-T creates a network of globally co-occurring pathways based on features created from the analysis of a compendium of gene expression data. Networks modeling the relationships between curated processes in a biological system offer a means for developing new hypotheses about which pathways influence each other and when. Our framework provides a data-driven characterization of the biological system at the pathway-level by identifying pairs of pathways that are overrepresented across many features.
PathCORE-T connects the features extracted from data to curated resources. It is important to note that PathCORE-T will only be able to identify pathways that occur in features of the underlying model, which means these pathways must be transcriptionally perturbed in at least some subset of the compendium. Models should be evaluated before analysis with PathCORE-T. The network resulting from PathCORE-T can help to identify global pathway-pathway relationships—a baseline network—that complements existing work to identify interactions between pathways in the context of a specific disease. The specific niche that PathCORE-T framework aims to fill is in revealing to researchers which gene sets are most closely related to each other in machine learning-based models of gene expression, which genes play a role in this co-occurrence, and which conditions drive this relationship.
Availability and requirements
Project name: PathCORE-T
Project home page: https://pathcore-demo.herokuapp.com
Archived version: https://github.com/greenelab/PathCORE-T-analysis/releases/tag/v1.2.0 (links to download .zip and .tar.gz files are provided here)
Operating system: Platform independent
Programming language: Python
Other requirements: Python 3 or higher
License: BSD 3-clause
Greene CS, Foster JA, Stanton BA, Hogan DA, Bromberg Y. Computational approaches to study microbes and microbiomes. Pac Symp Biocomput. 2016;21:557.
Tatlow PJ, Piccolo SR. a cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci Rep. 2016;6
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2012;41:D991–5.
Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, et al. ArrayExpress update—simplifying data submissions. Nucleic Acids Res. 2014:gku1057.
Tan J, Hammond JH, Hogan DA, Greene CS. Adage-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems. 2016;1:25.
Stein-O’Brien G, Carey J, Lee W, Considine M, Favorov A, Flam E, et al. PatternMarkers and Genome-Wide CoGAPS Analysis in Parallel Sets (GWCoGAPS) for data-driven detection of novel biomarkers via whole transcriptome Non-negative matrix factorization (NMF). bioRxiv. 2016:083717.
Tan J, Doing G, Lewis KA, Price CE, Chen KM, Cady KC, et al. Unsupervised extraction of stable expression signatures from public compendia with an Ensemble of Neural Networks. Cell Syst. 2017;5:63–71.e6.
Engreitz JM, Daigle BJ, Marshall JJ, Altman RB. Independent component analysis: mining microarray data for fundamental human gene expression modules. J Biomed Inform. 2010;43:932–44.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.
Donato M, Xu Z, Tomoiaga A, Granneman JG, MacKenzie RG, Bao R, et al. Analysis and correction of crosstalk effects in pathway analysis. Genome Res. 2013;23:1885–93.
Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al. PID: the pathway interaction database. Nucleic Acids Res. 2009;37:D679.
Brunet J-P, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. 2004;101:4164–9.
Kim PM, Tidor B. Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 2003;13:1706–18.
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The Cancer genome atlas Pan-Cancer analysis project. Nat Genet. 2013;45:1113–20.
Visakh R, Nazeer KA. Identifying epigenetically dysregulated pathways from pathway–pathway interaction networks. Comput Biol Med. 2016;76:160–7.
Yang J-B, Luo R, Yan Y, Chen Y. Differential pathway network analysis used to identify key pathways associated with pediatric pneumonia. Microb Pathog. 2016;101:50–5.
Pham L, Christadore L, Schaus S, Kolaczyk ED. Network-based prediction for sources of transcriptional dysregulation using latent pathway identification analysis. Proc Natl Acad Sci. 2011;108:13347–52.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.
Pan Q, Hu T, Andrew AS, Karagas MR, Moore JH. Bladder cancer specific pathway interaction networks. ECAL Citeseer. 2013:94–101.
Li Y, Agarwal P, Rajagopalan D. A global pathway crosstalk network. Bioinformatics. 2008;24:1442–7.
de A-JG, Meja-Pedroza RA, Espinal-Enrquez J, Hernndez-Lemus E. Crosstalk events in the estrogen signaling pathway may affect tamoxifen efficacy in breast cancer molecular subtypes. Comput Biol Chem. 2015;59:42–54.
Pita-Juárez Y, Altschuler G, Kariotis S, Wei W, Koler K, Green C, et al. The pathway Coexpression network: revealing pathway relationships. PLoS Comput Biol. 2018;14:e1006042.
Bell L, Chowdhary R, Liu JS, Niu X, Zhang J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS One. 2011;6:e21474.
Chowbina SR, Wu X, Zhang F, Li PM, Pandey R, Kasamsetty HN, et al. HPD: an online integrated human pathway database enabling systems biology studies. BMC Bioinformatics. 2009;10:S5.
Wu X, Chen JY. Molecular interaction networks: topological and functional characterizations. Autom Proteomics Genomics Eng Case-Based Approach. 2009;145
Glass K, Girvan M. Finding new order in biological functions from the network structure of gene annotations. PLoS Comput Biol. 2015;11:e1004565.
Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2:433–59.
Stone JV. Independent component analysis. Wiley Online Library. 2004;
Upton GJG. Fisher's exact test. J. Royal Stat. Soc. 1992;155:395–402.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc. 1995;57:289–300.
van der WS, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30.
Bostock M. D3js Data Driven Doc. 2012:492.
Stein-O’Brien GL, Arora R, Culhane AC, Favorov A, Greene C, Goff LA, et al. Enter the matrix: Interpreting unsupervised feature learning with matrix decomposition to discover hidden knowledge in high-throughput omics data. bioRxiv. 2017:196915.
Stability YB. Bernoulli. 2013;19:1484–500.
Liu X, Long D, You H, Yang D, Zhou S, Zhang S, et al. Phosphatidylcholine affects the secretion of the alkaline phosphatase PhoA in Pseudomonas strains. Microbiol Res. 2016;192:21–9.
Ball G, ric D, Lazdunski A, Filloux A. A novel type II secretion system in Pseudomonas aeruginosa. Mol Microbiol. 2002;43:475–85.
Ball G, Viarre V, Garvis S, Voulhoux R, Filloux A. Type II-dependent secretion of a Pseudomonas aeruginosa DING protein. Res Microbiol. 2012;163:457–69.
Stephens DL, Choe MD, Earhart CF. Escherichia coli periplasmic protein FepB binds ferrienterobactin. Microbiology. 1995;141:1647–54.
Ellison ML, IIII JMF, Parrish W, Danell AS, Pesci EC. The transcriptional regulator Np20 is the zinc uptake regulator in Pseudomonas aeruginosa. PLoS One. 2013;8:e75389.
Winsor GL, Lam DK, Fleming L, Lo R, Whiteside MD, Nancy YY, et al. Pseudomonas genome database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res. 2010:gkq869.
Althaus EW, Outten CE, Olson KE, Cao H, O'Halloran TV. The ferric uptake regulation (Fur) repressor is a zinc metalloprotein. Biochemistry. 1999;38:6559–69.
Ma Z, Faulkner MJ, Helmann JD. Origins of specificity and cross-talk in metal ion sensing by Bacillus subtilis Fur. Mol Microbiol. 2012;86:1144–55.
Bailey P, Chang DK, Nones K, Johns AL, Patch A-M, Gingras M-C, et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature. 2016;531:47–52.
Network CGAR. Integrated genomic analyses of ovarian carcinoma. Nature 2011;474:609–615.
Yuan Y, Allen EMV, Omberg L, Wagle N, Amin-Mansour A, Sokolov A, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol. 2014;32:644–52.
Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–74.
Moldovan G-L, D’Andrea AD. How the Fanconi Anemia pathway guards the genome. Annu Rev Genet. 2009;43:223–49.
Sadasivam S, DeCaprio JA. The DREAM complex: master coordinator of cell cycle-dependent gene expression. Nat Rev Cancer. 2013;13:585–95.
Tomida J, Itaya A, Shigechi T, Unno J, Uchida E, Ikura M, et al. A novel interplay between the Fanconi Anemia core complex and ATR-ATRIP kinase during DNA cross-link repair. Nucleic Acids Res. 2013;41:6930–41.
Itasaki N, Hoppler S. Crosstalk between Wnt and bone morphogenic protein signaling: a turbulent relationship. Dev Dyn. 2010;239:16–33.
MacDonald BT, Tamai K, He X. Wnt/β-catenin signaling: components, mechanisms, and diseases. Dev Cell. 2009;17:9–26.
Aman A, Piotrowski T. Wnt/β-catenin and Fgf signaling control collective cell migration by restricting chemokine receptor expression. Dev Cell. 2008;15:749–61.
Petersen M, Pardali E, Horst GVD, Cheung H, Hoogen CVD, Pluijm GVD, et al. Smad2 and Smad3 have opposing roles in breast cancer bone metastasis by differentially affecting tumor angiogenesis. Oncogene. 2010;29:1351–61.
Yoon H, Dehart JP, Murphy JM, S-TS L. Understanding the roles of FAK in cancer: inhibitors, genetic models, and new insights. J Histochem Cytochem. 2015;63:114–28.
Chinnam M, Goodrich DW. RB1, development. and cancer Curr Top Dev Biol. 2011;94:129.
Takuwa Y, Du W, Qi X, Okamoto Y, Takuwa N, Yoshioka K. Roles of sphingosine-1-phosphate signaling in angiogenesis. World J Biol Chem. 2010;1:298–306.
Watson C, Long JS, Orange C, Tannahill CL, Mallon E, McGlynn LM, et al. High expression of sphingosine 1-phosphate receptors, S1P 1 and S1P 3, sphingosine kinase 1, and extracellular signal-regulated kinase-1/2 is associated with development of tamoxifen resistance in estrogen receptor-positive breast cancer patients. Am J Pathol. 2010;177:2205–15.
Paik JH, Chae S, Lee M-J, Thangada S, Hla T. Sphingosine 1-phosphate-induced endothelial cell migration requires the expression of EDG-1 and EDG-3 receptors and rho-dependent activation of αvβ3-and β1-containing integrins. J Biol Chem. 2001;276:11830–7.
Serini G, Valdembri D, Bussolino F. Integrins and angiogenesis: a sticky business. Exp Cell Res. 2006;312:651–8.
Brogdon JL, Leitenberg D, Bottomly K. The potency of TCR signaling differentially regulates NFATc/p activity and early IL-4 transcription in naive CD4 T cells. J Immunol. 2002;168:3825–32.
Hsieh C-S, Macatonia SE, Tripp CS, Wolf SF, O’Garra A, Murphy KM. Development of TH1 CD4 T cells through IL-12. Science. 1993;260:547.
Vacaflores A, Chapman NM, Harty JT, Richer MJ, Houtman JC. Exposure of human CD4 T cells to IL-12 results in enhanced TCR-induced cytokine production, altered TCR signaling, and increased oxidative metabolism. PLoS One. 2016;11:e0157175.
Lau E, Ze’ev AR. ATF2–at the crossroad of nuclear and cytosolic functions. J Cell Sci. 2012;125:2815–24.
Shaulian E, Karin M. AP-1 as a regulator of cell life and death. Nat Cell Biol. 2002;4:E136.
Mller MR, Rao A. NFAT, immunity and cancer: a transcription factor comes of age. Nat Rev Immunol. 2010;10:645–56.
McKinney W. Data structures for statistical computing in python. Proc. 9th Python Sci. Conf. van der Voort S, Millman J; 2010. p. 51–6.
Jones E, Oliphant T, Peterson P. SciPy: open source scientific tools for Python. 2014;
Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. Proc. 9th Python Sci. Conf. 2010:61.
The authors would also like to thank Daniel Himmelstein, Dongbo Hu, Kurt Wheeler, and René Zelaya for helping to review the source code.
KMC was funded by an undergraduate research grant from the Penn Institute for Biomedical Informatics. CSG was funded in part by a grant from the Gordon and Betty Moore Foundation (GBMF 4552). JT, GD, DAH, and CSG were funded in part by a pilot grant from the Cystic Fibrosis Foundation (STANTO15R0). DAH was funded in part by R01-AI091702.
Availability of data and materials
P. aeruginosa eADAGE models: https://doi.org/10.5281/zenodo.583172
P. aeruginosa gene expression compendium: doi:https://doi.org/10.5281/zenodo.583694
The normalized gene expression compendium provided in this Zenodo record contains datasets on the GPL84 from ArrayExpress as of 31 July 2015. It includes 1051 samples grown in 78 distinct medium conditions, 128 distinct strains and isolates, and dozens of different environmental parameters . Tan et al. compiled this dataset and used it to construct the 10 eADAGE models from which we generate the eADAGE-based P. aeruginosa KEGG network . We use this same data compendium to generate the NMF P. aeruginosa model. The script used to download and process those datasets into the compendium is available at https://goo.gl/YjOEQl.
TCGA pan-cancer dataset: https://doi.org/10.5281/zenodo.56735
The pan-cancer dataset was compiled using data from all 33 TCGA cohorts. It was generated by the TCGA Research Network: http://cancergenome.nih.gov/ . RNA-seq data was downloaded on 8 March 2016 from UCSC Xena (https://xenabrowser.net/datapages/). Gene expression was measured using the Illumina HiSeq technology. More information and the latest version of the dataset can be found at https://xenabrowser.net/datapages/?dataset=TCGA.PANCAN.sampleMap/HiSeqV2&host=https://tcga.xenahubs.net.
PathCORE-T analysis: (https://github.com/greenelab/PathCORE-T-analysis/tree/v1.2.0) This repository contains all the scripts to reproduce the analyses described in this paper. The Python scripts here should be used as a starting point for new PathCORE-T analyses. Instructions for setting up a web application for a user’s specific PathCORE-T analysis are provided in this repository’s README.
Overlap correction: (https://github.com/kathyxchen/crosstalk-correction/tree/v1.0.4) Donato et al.’s procedure for overlap correction  is a pip-installable Python package ‘crosstalk-correction’ that is separate from, but listed as a dependency in, PathCORE-T. It is implemented using NumPy .
PathCORE-T methods: (https://github.com/greenelab/PathCORE-T/tree/v1.0.2) The methods included in the PathCORE-T analysis workflow (Fig. 4c) are provided as a pip-installable Python package ‘pathcore’. It is implemented using Pandas , SciPy (specifically, scipy.stats) , StatsModels , and the crosstalk-correction package.
PathCORE-T demo application: (https://github.com/kathyxchen/PathCORE-T-demo/tree/v2.1.0) The project home page, https://pathcore-demo.herokuapp.com provides links to
The web application for the eADAGE-based KEGG P. aeruginosa described in the first case study.
A view of the NMF-based PID pathway co-occurrence network described in the second case study.
A quick view page where users can temporarily load and visualize their own network file (generated from the PathCORE-T analysis).
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Figure S1. The weight distributions of genes to constructed features for NMF models of various sizes. (EPS 494 kb)
PathCORE-T network constructed from a Pseudomonas aeruginosa compendium using 10 eADAGE models analyzed with KEGG pathway annotations. (TSV 16 kb)
PathCORE-T network constructed from TCGA data using NMF models analyzed with PID pathway annotations. (TSV 5 kb)
About this article
Cite this article
Chen, K.M., Tan, J., Way, G.P. et al. PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia. BioData Mining 11, 14 (2018) doi:10.1186/s13040-018-0175-7
- Gene expression
- Unsupervised feature construction
- Pathway interactions