- Software article
- Open Access
- Open Peer Review
This article has Open Peer Review reports available.
PathCORE-T: identifying and visualizing globally co-occurring pathways in large transcriptomic compendia
© The Author(s). 2018
Received: 16 April 2018
Accepted: 18 June 2018
Published: 3 July 2018
Investigators often interpret genome-wide data by analyzing the expression levels of genes within pathways. While this within-pathway analysis is routine, the products of any one pathway can affect the activity of other pathways. Past efforts to identify relationships between biological processes have evaluated overlap in knowledge bases or evaluated changes that occur after specific treatments. Individual experiments can highlight condition-specific pathway-pathway relationships; however, constructing a complete network of such relationships across many conditions requires analyzing results from many studies.
We developed PathCORE-T framework by implementing existing methods to identify pathway-pathway transcriptional relationships evident across a broad data compendium. PathCORE-T is applied to the output of feature construction algorithms; it identifies pairs of pathways observed in features more than expected by chance as functionally co-occurring. We demonstrate PathCORE-T by analyzing an existing eADAGE model of a microbial compendium and building and analyzing NMF features from the TCGA dataset of 33 cancer types. The PathCORE-T framework includes a demonstration web interface, with source code, that users can launch to (1) visualize the network and (2) review the expression levels of associated genes in the original data. PathCORE-T creates and displays the network of globally co-occurring pathways based on features observed in a machine learning analysis of gene expression data.
The PathCORE-T framework identifies transcriptionally co-occurring pathways from the results of unsupervised analysis of gene expression data and visualizes the relationships between pathways as a network. PathCORE-T recapitulated previously described pathway-pathway relationships and suggested experimentally testable additional hypotheses that remain to be explored.
The number of publicly available genome-wide datasets is growing rapidly . High-throughput sequencing technologies that measure gene expression quickly with high accuracy and low cost continue to enable this growth . Expanding public data repositories [3, 4] have laid the foundation for computational methods that consider entire compendia of gene expression data to extract biological patterns . These patterns may be difficult to detect in measurements from a single experiment. Unsupervised approaches, which identify important signals in the data without being constrained to previously-described patterns, may discover new expression modules and thus will complement supervised methods, particularly for exploratory analyses [6, 7].
Feature extraction methods are a class of unsupervised algorithms that can reveal unannotated biological processes from genomic data . Each feature can be defined by a subset of influential genes, and these genes suggest the biological or technical pattern captured by the feature. These features, like pathways, are often considered individually [7, 8]. When examined in the context of knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) , most features are significantly enriched for more than one biological gene set . In this work, we refer to such a gene set by the colloquial term, pathway. It follows then that such features can be described by sets of functionally related pathways. We introduce the PathCORE-T (identifying pathway co-occurrence relationships in transcriptomic data) software, which implements existing methods that jointly consider features and gene sets to map pathways with shared transcriptional responses.
PathCORE-T offers a data-driven approach for identifying and visualizing transcriptional pathway-pathway relationships. In this case, relationships are drawn based on the sets of pathways, annotated in a resource of gene sets, occurring within constructed features. Because PathCORE-T starts from a feature extraction model, the number of samples in the compendium used for model generation and the fraction of samples needed to observe a specific biological or technical pattern is expected to vary by feature extraction method. Pathways must be perturbed in a sufficient fraction of experiments in the data compendium to be captured by any such method. To avoid discovering relationships between pathways that share many genes—which could more easily be discovered by directly comparing pathway membership—we implement an optional pre-processing step that corrects for genes shared between gene sets, which Donato et al. refer to as pathway crosstalk . Donato et al.’s correction method, maximum impact estimation, has not previously been implemented in open source software. We have released our implementation of maximum impact estimation as its own Python package (PyPI package name: crosstalk-correction) so that it can be used independently of PathCORE-T. Applying this correction in PathCORE-T software allows a user to examine relationships between gene sets based on how genes are expressed as opposed to which genes are shared.
We apply PathCORE-T to a microbial and a cancer expression dataset, each analyzed using different feature extraction methods, to demonstrate its broad applicability. For the microbial analysis, we created a network of KEGG pathways from recently described ensemble Analysis using Denoising Autoencoders for Gene Expression (eADAGE) models trained on a compendium of Pseudomonas aeruginosa (P. aeruginosa) gene expression data (doi:https://doi.org/10.5281/zenodo.583694) . We provide a live demo of the PathCORE-T web application for this network: users can click on edges in the network to review the expression levels of associated genes in the original compendium (https://pathcore-demo.herokuapp.com/PAO1). To show its use outside of the microbial space, we also demonstrate PathCORE-T analysis of Pathway Interaction Database (PID)-annotated  non-negative matrix factorization (NMF) features [12, 13] extracted from The Cancer Genome Atlas’s (TCGA) pan-cancer dataset of 33 different tumor types (doi:https://doi.org/10.5281/zenodo.56735) .
In addition to visualizing the results of these two applications, the PathCORE-T web interface (https://pathcore-demo.herokuapp.com/) links to the documentation and source code for our implementation and example usage of PathCORE-T. Methods implemented in PathCORE-T are written in Python and pip-installable (PyPI package name: PathCORE-T). Examples of how to use these methods are provided in the PathCORE-T analysis repository (https://github.com/greenelab/PathCORE-T-analysis). In addition to scripts that reproduce the eADAGE and NMF analyses described in this paper, the PathCORE-T-analysis repository includes a Jupyter notebook (https://goo.gl/VuzN12) with step-by-step descriptions for the complete PathCORE-T framework.
Our approach diverges from other algorithms that we identified in the literature in its intent: PathCORE-T finds pathway pairs within a biological system that are overrepresented in features constructed from diverse transcriptomic data. This complements other work that developed models specific to a single condition or disease. Approaches designed to capture pathway-pathway interactions from gene expression experiments for disease-specific, case-control studies have been published [15, 16]. For example, Pham et al. developed Latent Pathway Identification Analysis to find pathways that exert latent influences on transcriptionally altered genes . Under this approach, the transcriptional response profiles for a binary condition (disease/normal), in conjunction with pathways specified in the KEGG and functions in Gene Ontology (GO) , are used to construct a pathway-pathway network where key pathways are identified by their network centrality scores . Similarly, Pan et al. measured the betweenness centrality of pathways in disease-specific genetic interaction and coexpression networks to identify those most likely to be associated with bladder cancer risk . These methods captured pathway relationships associated with a particular disease state.
Global networks identify relationships between pathways that are not disease- or condition-specific. One such network, detailed by Li et al., relied on publicly available protein interaction data to determine pathway-pathway interactions . Two pathways were connected in the network if the number of protein interactions between the pair was significant with respect to the computed background distribution. Such approaches rely on databases of interactions, though the interactions identified can be subsequently used for pathway-centric analyses of transcriptomic data [20, 21]. Pita-Juárez et al. created the Pathway Coexpression Network (PCxN) as a tool to discover pathways correlated with a pathway of interest . They estimated correlations between pathways based on the expression of their underlying genes (as annotated in MSigDB) across a curated compendium of microarray data . Software like PathCORE-T that generates global networks of pathway relationships from unsupervised feature analysis models built using transcriptomics data has not yet been published.
The intention of PathCORE-T is to work from transcriptomic data in ways that do not give undue preference to combinations of pathways that share genes. Other methods have sought to consider shared genes between gene sets, protein-protein interactions, or other curated knowledgebases to define pathway-pathway interactions [20, 21, 23–25]. For example, Glass and Girvan described another network structure that relates functional terms in GO based on shared gene annotations . In contrast with this approach, PathCORE-T specifically removes gene overlap in pathway definitions before they are used to build a network. Our software reports pathway-pathway connections overrepresented in gene expression patterns extracted from a large transcriptomic compendium while controlling for the fact that some pathways share genes.
Our software is written in Python and pip-installable (PyPI package name: PathCORE-T), and examples of how to use the methods in PathCORE-T are provided in the PathCORE-T-analysis repository (https://github.com/greenelab/PathCORE-T-analysis). We recommend that those interested in using the PathCORE-T software consult the documentation and scripts in PathCORE-T-analysis. Each of the functions in PathCORE-T that we describe here can be used independently; however, we expect most users to employ the complete approach for interpreting pathways shared in extracted features (Fig. 1).
A weight matrix that connects each gene to each feature. We expect that this results from the application of a feature construction algorithm to a compendium of gene expression data. The primary requirements are that features must contain the full set of genes in the compendium and genes must have been assigned weights that quantify their contribution to a given feature. Accordingly, a weight matrix will have the dimensions n x k, where n is the number of genes in the compendium and k is the number of features constructed. In principal component analysis (PCA), this is the loadings matrix ; in independent component analysis (ICA), it is the unmixing matrix ; in ADAGE or eADAGE it is termed the weight matrix [5, 7]; in NMF it is the matrix W, where the NMF approximation of the input dataset A is A ~ WH . In addition to the scripts we provide for the eADAGE and NMF examples in the PathCORE-T analysis repository, we include a Jupyter notebook (https://goo.gl/VuzN12) that demonstrates how a weight matrix can be constructed by applying ICA to the P. aeruginosa gene compendium.
Gene signature rule(s). To construct a pathway co-occurrence network, the weight matrix must be processed into gene signatures by applying threshold(s) to the gene weights in each feature—we refer to these as gene signature rules. Subsequent pathway overrepresentation will be determined by the set of genes that makes up a feature’s gene signature. These are often the weights at the extremes of the distribution. How gene weights are distributed will depend on the user’s selected feature construction algorithm; because of this, a user must specify criterion for including a gene in a gene signature. PathCORE-T permits rules for a single gene signature or both a positive and a negative gene signature. The use of 2 signatures may be appropriate when the feature construction algorithm produces positive and negative weights, the extremes of which both characterize a feature (e.g. PCA, ICA, ADAGE or eADAGE). Because a feature can have more than one gene signature, we maintain a distinction between a feature and a feature’s gene signature(s).
A list of pathway definitions, where each pathway is defined by a set of genes (e.g. KEGG pathways, PID pathways, GO biological processes). We provide the files for the P. aeruginosa KEGG pathway definitions and the Nature-NCI PID pathway definitions in the PathCORE-T analysis repository (https://github.com/greenelab/PathCORE-T-analysis/tree/master/data).
Weight matrix construction and signature definition
In practice, users can obtain a weight matrix from many different methods. For the purposes of this paper, we demonstrate generality by constructing weight matrices via eADAGE and NMF.
eADAGE is an unsupervised feature construction algorithm developed by Tan et al.  that uses an ensemble of neural networks (an ensemble of ADAGE models) to capture biological patterns embedded in the expression compendium. We use models from Tan et al. . In that work, Tan et al. evaluated multiple eADAGE model sizes to identify that k = 300 features was an appropriate size for the current P. aeruginosa compendium. The authors also compared eADAGE to two other commonly used feature construction approaches, PCA and ICA . Tan et al. produced 10 eADAGE models that each extracted k = 300 features from the compendium of genome-scale P. aeruginosa data. Because PathCORE-T supports the aggregation of co-occurrence networks created from different models on the same input data, we use all 10 of these models in the PathCORE-T analysis of eADAGE models (doi:https://doi.org/10.5281/zenodo.583172).
Tan et al. refers to the features constructed by eADAGE as nodes. They are represented as a weight matrix of size n x k, where n genes in the compendium are assigned positive or negative gene weights, according to a standard normal distribution, for each of the k features. Tan et al. determined that the gene sets contributing the highest positive or highest negative weights (+/− 2.5 standard deviations) to a feature described gene expression patterns across the compendium, and thus referred to the gene sets as signatures. Because a feature’s positive and negative gene signatures did not necessarily correspond to the same biological processes or functions, Tan et al. analyzed each of these sets separately . Tan et al.’s gene signature rules are specified as an input to the PathCORE-T analysis as well.
We also constructed an NMF model for the TCGA pan-cancer dataset. Given an NMF approximation of A ~ WH , where A is the input expression dataset of size n x s (n genes by s samples), NMF aims to find the optimal reconstruction of A by WH such that W clusters on samples (size n x k) and H clusters on genes (size k x s). In order to match the number of features constructed in each eADAGE model by Tan et al., we set k, the desired number of features, to be 300 and used W as the input weight matrix for the PathCORE-T software. We found that the gene weight distribution of an NMF feature is right-skewed and (as the name suggests) non-negative (Additional file 1 Figure S1). In this case, we defined a single gene signature rule: an NMF feature’s gene signature is the set of genes with weights 2.0 standard deviations above the mean weight of the feature.
The selection of k = 300 for the NMF model allowed us to make the eADAGE-based and NMF-based case studies roughly parallel. We verified that 300 components was appropriate by evaluating the percentage of variance explained by PCA applied to the TCGA dataset. In general, the principal components explained very little variance—the first principal component only explained 11% of the variance. At 300 components, the proportion of variance explained was 81%.
As an additional analysis, we determined the number of components (k = 24) where each additional component explained less than 0.5% of the variance. We found that using a very small number of constructed features resulted in a substantial loss of power: PathCORE-T analysis with a single k = 24 model yielded no significant edges after permutation test. However, PathCORE-T can be applied over multiple models as long as the feature construction method produces different solutions depending on random seed initialization. We performed 10 factorizations to generate an aggregate of 10 k = 24-feature NMF models and found that the resulting co-occurrence network was denser (364 edges) than our k = 300 factor network (119 edges). 65 edges were found in both networks. These shared edges had higher weights, on average, in both networks compared to edges unique to each network (https://goo.gl/vnDVNA).
Construction of a pathway co-occurrence network
The network that results from the preceding method is densely connected, and many edges may be spurious. To remove correlations that cannot be distinguished from random pathway associations, we define a statistical test that determines whether a pathway-pathway relationship appearing x times in a k-feature model is unexpected under the null hypothesis—the null hypothesis being that the relationship does not appear more often than it would in a random network. We create N weighted null networks, where each null network is constructed by permuting overrepresented pathways across the model’s gene signatures while preserving the number of pathways for which each gene signature is enriched (Fig. 2b). N is a user-settable parameter: the example PathCORE-T analyses we provide specify an N of 10,000. Increasing the value of N leads to more precise p-values, particularly for low p-values, but comes at the expense of additional computation time.
In the case where we have positive and negative gene signatures, overrepresentation can be positive or negative. Because certain pathways may display bias toward one side—for example, a pathway may be overrepresented more often in features’ positive gene signatures—we perform the permutation separately for each side. The N random networks produce the background weight distribution for every observed edge; significance can then be assessed by comparing the true (observed) edge weight against the null. The p-value for each edge e is calculated by summing the number of times a random network contained e at a weight greater than or equal to its observed edge weight and dividing this value by N. Following Benjamini—Hochberg FDR correction by the number of edges in the observed network, pathway-pathway relationships with adjusted p-values above alpha (user-settable default: 0.05) are removed from the network of co-occurring pathways (Fig. 2c). The threshold alpha value is a configurable parameter, and the user should select an FDR that best balances the costs and consequences of false positives. For highly exploratory analyses in which it may be helpful to have more speculative edges, this value can be raised. For analyses that require particularly stringent control, it can be lowered.
Because the expected weight of every edge can be determined from the N random networks (by taking the sum of the background weight distribution for an edge and dividing it by N), we can divide each observed edge weight by its expected weight (dividing by 1 if the expected edge weight is 0 based on the N permutations) to get the edge’s odds ratio. Edges in the final network are weighted by their odds ratios.
Gene overlap correction
There was no existing open source implementation of this algorithm, so we implemented Donato et al.’s maximum impact estimation as a Python package (PyPI package name: crosstalk-correction). This software is separate from PathCORE-T because we expect that it may be useful in its own right for other analytical workflows, such as differential expression analysis. The procedure is written using NumPy functions and data structures, which allows for efficient implementation of array and matrix operations in Python .
In PathCORE-T, we used this software to resolve overlapping genes before pathway overrepresentation analysis. Overlap correction is applied to each feature of the model independently. This most closely matches the setting evaluated by the original authors of the method. Work on methods that resolves overlap by using information shared across features may provide opportunities for future enhancements but was deemed to be out of the scope of a software contribution.
With this step, the pathway co-occurrence network identifies relationships that are not driven simply by the same genes being annotated to multiple pathways. Without this correction step, it is difficult to determine whether a co-occurrence relationship can be attributed to the features extracted from expression data or gene overlap in the two pathway annotations. We incorporate this correction into the PathCORE-T workflow by default; however, users interested in using PathCORE-T to find connections between overlapping gene sets can choose to disable the correction step.
PathCORE-T network visualization and support for experimental follow-up
The PathCORE-T analysis workflow outputs a list of pathway-pathway relationships, or edges in a network visualization, as a text file. An example of the KEGG P. aeruginosa edges file is available for download on the demo application: http://pathcore-demo.herokuapp.com/quickview. While we chose to represent pathway-pathway relationships as a network, users can use this file output to visualize the identified relationships as an adjacency matrix or in any other format they choose.
We created a web interface for deeper examination of relationships present in the pathway co-occurrence network. The details we included in an edge-specific page (1) highlight up to twenty genes—annotated to either of the two pathways in the edge—contained in features that also contain this edge, after controlling for the total number of features that contain each gene, and (2) display the expression levels of these genes in each of the fifteen samples where they were most and least expressed. The quantity of information (twenty genes, thirty samples total) we choose to include in an edge page is intentionally limited so that users can review it in a reasonable amount of time.
To implement the functionality in (1), we computed an odds ratio for every gene annotated to one or both pathways in the edge. The odds ratio measures how often we observe a feature enriched for both the given gene and the edge of interest relative to how often we would expect to see this occurrence. We calculate the proportion of observed cases and divide by the expected proportion--equivalent to the frequency of the edge appearing in the model’s features.
Let k be the number of features from which the PathCORE-T network was built. kG is the number of features that contain gene G (i.e. G is in kG features’ gene signatures), kE the number of features that contain edge E (i.e. the two pathways connected by E are overrepresented in kE features), and kG & E the number of features that contain both gene G and edge E. The odds ratio is computed as follows:
An odds ratio above 1 suggests that the gene is more likely to appear in features enriched for this pair of pathways. In the web interface, we sort the genes by their odds ratio to highlight genes most observed with the co-occurrence relationship.
The information specified in (2) requires an “expression score” for every sample. A sample expression score is calculated using the genes we selected in goal (1): it is the average of the normalized gene expression values weighted by the normalized gene odds ratio. Selection of the most and least expressed samples is based on these scores. We use two heatmaps to show the (maximum of twenty) genes’ expression values in each of the fifteen most and least expressed samples (Fig. 4b).
For each sample in an edge page, a user can examine how the expression values of the edge’s twenty genes in that sample compare to those recorded for all other samples in the dataset that are from the same experiment (Fig. 4c). Genes that display distinct expression patterns under a specific setting may be good candidates for follow-up studies.
Unsupervised methods can identify previously undiscovered patterns in large collections of data. PathCORE-T overlays curated knowledge after feature construction to help researchers interpret constructed features in the context of existing knowledgebases. Specifically, PathCORE-T aims to clarify how expert-annotated gene sets work together from a gene expression perspective. PathCORE-T starts from an unsupervised feature construction model. Before applying the software, users should evaluate models to make sure that they capture biological features in their dataset. Model evaluation can be performed in numerous ways depending on the setting and potential assessments include consistency across biological replicates, reconstruction error given a fixed dimensionality, and independent validation experiments. Tan et al. described several ways that models could be evaluated . Datasets will vary in terms of their amenability to analysis by different model-building strategies, and researchers may wish to consult a recent review for more discussion of feature construction methods .
We implemented the methods contained in the PathCORE-T software in Python. The implementations of the primary steps are pip-installable (PyPI package name: PathCORE-T), and examples of how to use the methods in PathCORE-T are provided in the PathCORE-T-analysis repository (https://github.com/greenelab/PathCORE-T-analysis). We also implemented an optional step, which corrects for overlapping genes between pathway definitions, described by Donato et al. . Though the algorithm had been described, no publicly available implementation existed. We provide this overlap correction algorithm as a Python package (PyPI package name: crosstalk-correction) available under the BSD 3-clause license. Each component of PathCORE-T can be used independently of each other (Fig. 1c).
Here, we present analyses that can be produced by applying the full PathCORE-T pipeline to models created from a transcriptomic compendium by an unsupervised feature construction algorithm. Input pathway definitions are “overlap-corrected” for each feature before enrichment analysis. An overlap-corrected, weighted pathway co-occurrence network is built by connecting the pairs of pathways that are overrepresented in features of the model. Finally, we remove edges that cannot be distinguished from a null model of random associations based on the results of a permutation test.
Case study: P. aeruginosa eADAGE models annotated with KEGG pathways
We used PathCORE-T to create a network of co-occurring pathways out of the expression signatures extracted by eADAGE from a P. aeruginosa compendium . For every feature, overlap correction was applied to the P. aeruginosa KEGG pathway annotations and overlap-corrected annotations were used in the overrepresentation analysis. PathCORE-T aggregates multiple networks by taking the union of the edges across all networks and summing the weights of common pathway-pathway connections. We do this to emphasize the co-occurrence relationships that are more stable —that is, the relationships that appear across multiple models. Finally, we removed edges in the aggregate network that were not significant after FDR correction when compared to the background distributions generated from 10,000 permutations of the network. Used in this way, PathCORE-T software allowed for exploratory analysis of an existing well-validated model.
The network constructed using the PathCORE-T framework had 203 edges between 89 pathways. For comparison, we constructed a KEGG pathway-pathway network where edges were drawn between pathways with significant gene overlap (FDR-corrected hypergeometric test < 0.05). The overlap-based network had 406 edges between 158 pathways. Only 35 of the edges in the PathCORE-T network were between pathways that shared genes, with an average Jaccard Index of only 0.035. The network constructed using PathCORE-T (with overlap-correction applied by default) captured pathway co-occurrences not driven by shared genes between pathways.
Case study: TCGA’s pan-cancer compendium analyzed by NMF with PID pathways
PathCORE-T is not specific to a certain dataset, organism, or feature construction method. We constructed a 300-feature NMF model of TCGA pan-cancer gene expression data, which is comprised of 33 different cancer-types from various organ sites and applied the PathCORE-T software to those features. We chose NMF because it has been used in previous studies to identify biologically relevant patterns in transcriptomic data  and by many studies to derive molecular subtypes [43–45]. The 300 NMF features were analyzed using overlap-corrected PID pathways, a collection of 196 human cell signaling pathways with a particular focus on processes relevant to cancer .
Two modules in the network related to angiogenesis, or the formation of new blood vessels (Fig. 6c, d). Tumors transmit signals that stimulate angiogenesis because a blood supply provides the necessary oxygen and nutrients for their growth and proliferation. One module relates angiogenesis factors to cell proliferation. This module connected the following pathways: PDGFR-beta signaling, FAK-mediated signaling events, VEGFR1 and VEGFR2-mediated signaling events, nuclear SMAD2/3 signaling regulation, and RB1 regulation (Fig. 6c). These functional connections are known to be involved in tumor proliferation [53–55]. The other module indicated a relationship by which the VEGF pathway interacts with the S1P3 pathway through Beta3 integrins (Fig. 6d). S1P3 is a known regulator of angiogenesis , and has been demonstrated to be associated with treatment-resistant breast cancer and poor survival . Moreover, this interaction module has been observed to promote endothelial cell migration in human umbilical veins . Taken together, this independent module may suggest a distinct angiogenesis process activated in more aggressive and metastatic tumors that is disrupted and regulated by alternative mechanisms .
Finally, PathCORE-T revealed a large, densely connected module of immune related pathways (Fig. 6e). We found that this module contains many co-occurrence relationships that align with immune system processes. One such example is the well characterized interaction cycle formed by T-Cell Receptor (TCR) signaling in naïve CD4+ T cells and IL-12/IL-4 mediated signaling events [60–62]. At the same time, PathCORE-T identifies additional immune-related relationships. We observed a cycle between the three transcription factor networks: ATF-2, AP-1, and CaN-regulated NFAT-dependent transcription. These pathways can take on different, often opposing, functions depending on the tissue and subcellular context. For example, ATF-2 can be an oncogene in one context (e.g. melanoma) and a tumor suppressor in another (e.g. breast cancer) . AP-1, comprised of Jun/Fos proteins, is associated with both tumorigenesis and tumor suppression due to its roles in cell survival, proliferation, and cell death . Moreover, NFAT in complex with AP-1 regulates immune cell differentiation, but dysregulation of NFAT signaling can lead to malignant growth and tumor metastasis . The functional association observed between the ATF-2, AP-1, and NFAT cycle together within the immunity module might suggest that dysregulation within this cycle has profound consequences for immune cell processes and may trigger variable oncogenic processes.
Just as we did for the eADAGE-based P. aeruginosa KEGG pathways case study, we constructed a network only from PID pathways with significant gene overlap. The network constructed using PathCORE-T and NMF features had 119 edges between 57 pathways. The overlap-based network was much denser: it had 3826 edges between 196 pathways. This suggested a high degree of overlap between PID pathways. For the PathCORE-T NMF co-occurrence network, 96 of the edges were between pathways that shared genes. However, the average Jaccard Index for these pathway pairs remained low, at 0.058.
Unsupervised analyses of genome-scale datasets that summarize key patterns in the data have the potential to improve our understanding of how a biological system operates via complex interactions between molecular processes. Feature construction algorithms capture coordinated changes in the expression of many genes as features. The genes that contribute most to each feature co-vary. However, interpreting the features generated by unsupervised approaches remains challenging. PathCORE-T creates a network of globally co-occurring pathways based on features created from the analysis of a compendium of gene expression data. Networks modeling the relationships between curated processes in a biological system offer a means for developing new hypotheses about which pathways influence each other and when. Our framework provides a data-driven characterization of the biological system at the pathway-level by identifying pairs of pathways that are overrepresented across many features.
PathCORE-T connects the features extracted from data to curated resources. It is important to note that PathCORE-T will only be able to identify pathways that occur in features of the underlying model, which means these pathways must be transcriptionally perturbed in at least some subset of the compendium. Models should be evaluated before analysis with PathCORE-T. The network resulting from PathCORE-T can help to identify global pathway-pathway relationships—a baseline network—that complements existing work to identify interactions between pathways in the context of a specific disease. The specific niche that PathCORE-T framework aims to fill is in revealing to researchers which gene sets are most closely related to each other in machine learning-based models of gene expression, which genes play a role in this co-occurrence, and which conditions drive this relationship.
Availability and requirements
Project name: PathCORE-T
Project home page: https://pathcore-demo.herokuapp.com
Archived version: https://github.com/greenelab/PathCORE-T-analysis/releases/tag/v1.2.0 (links to download .zip and .tar.gz files are provided here)
Operating system: Platform independent
Programming language: Python
Other requirements: Python 3 or higher
License: BSD 3-clause
The authors would also like to thank Daniel Himmelstein, Dongbo Hu, Kurt Wheeler, and René Zelaya for helping to review the source code.
KMC was funded by an undergraduate research grant from the Penn Institute for Biomedical Informatics. CSG was funded in part by a grant from the Gordon and Betty Moore Foundation (GBMF 4552). JT, GD, DAH, and CSG were funded in part by a pilot grant from the Cystic Fibrosis Foundation (STANTO15R0). DAH was funded in part by R01-AI091702.
Availability of data and materials
P. aeruginosa eADAGE models: https://doi.org/10.5281/zenodo.583172
P. aeruginosa gene expression compendium: doi:https://doi.org/10.5281/zenodo.583694
The normalized gene expression compendium provided in this Zenodo record contains datasets on the GPL84 from ArrayExpress as of 31 July 2015. It includes 1051 samples grown in 78 distinct medium conditions, 128 distinct strains and isolates, and dozens of different environmental parameters . Tan et al. compiled this dataset and used it to construct the 10 eADAGE models from which we generate the eADAGE-based P. aeruginosa KEGG network . We use this same data compendium to generate the NMF P. aeruginosa model. The script used to download and process those datasets into the compendium is available at https://goo.gl/YjOEQl.
TCGA pan-cancer dataset: https://doi.org/10.5281/zenodo.56735
The pan-cancer dataset was compiled using data from all 33 TCGA cohorts. It was generated by the TCGA Research Network: http://cancergenome.nih.gov/ . RNA-seq data was downloaded on 8 March 2016 from UCSC Xena (https://xenabrowser.net/datapages/). Gene expression was measured using the Illumina HiSeq technology. More information and the latest version of the dataset can be found at https://xenabrowser.net/datapages/?dataset=TCGA.PANCAN.sampleMap/HiSeqV2&host=https://tcga.xenahubs.net.
PathCORE-T analysis: (https://github.com/greenelab/PathCORE-T-analysis/tree/v1.2.0) This repository contains all the scripts to reproduce the analyses described in this paper. The Python scripts here should be used as a starting point for new PathCORE-T analyses. Instructions for setting up a web application for a user’s specific PathCORE-T analysis are provided in this repository’s README.
Overlap correction: (https://github.com/kathyxchen/crosstalk-correction/tree/v1.0.4) Donato et al.’s procedure for overlap correction  is a pip-installable Python package ‘crosstalk-correction’ that is separate from, but listed as a dependency in, PathCORE-T. It is implemented using NumPy .
PathCORE-T methods: (https://github.com/greenelab/PathCORE-T/tree/v1.0.2) The methods included in the PathCORE-T analysis workflow (Fig. 4c) are provided as a pip-installable Python package ‘pathcore’. It is implemented using Pandas , SciPy (specifically, scipy.stats) , StatsModels , and the crosstalk-correction package.
The web application for the eADAGE-based KEGG P. aeruginosa described in the first case study.
A view of the NMF-based PID pathway co-occurrence network described in the second case study.
A quick view page where users can temporarily load and visualize their own network file (generated from the PathCORE-T analysis).
KMC implemented the software, performed the analyses, and drafted the manuscript. JT and GPW contributed computational reagents. KMC, DAH, and CSG designed the project. KMC, JT, GPW, GD, DAH, and CSG interpreted the results. JT, GPW, GD, DAH, and CSG provided critical feedback and revisions on the manuscript. JT and GPW reviewed source code. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Greene CS, Foster JA, Stanton BA, Hogan DA, Bromberg Y. Computational approaches to study microbes and microbiomes. Pac Symp Biocomput. 2016;21:557.PubMedPubMed CentralGoogle Scholar
- Tatlow PJ, Piccolo SR. a cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci Rep. 2016;6Google Scholar
- Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2012;41:D991–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Kolesnikov N, Hastings E, Keays M, Melnichuk O, Tang YA, Williams E, et al. ArrayExpress update—simplifying data submissions. Nucleic Acids Res. 2014:gku1057.Google Scholar
- Tan J, Hammond JH, Hogan DA, Greene CS. Adage-based integration of publicly available Pseudomonas aeruginosa gene expression data with denoising autoencoders illuminates microbe-host interactions. mSystems. 2016;1:25.View ArticleGoogle Scholar
- Stein-O’Brien G, Carey J, Lee W, Considine M, Favorov A, Flam E, et al. PatternMarkers and Genome-Wide CoGAPS Analysis in Parallel Sets (GWCoGAPS) for data-driven detection of novel biomarkers via whole transcriptome Non-negative matrix factorization (NMF). bioRxiv. 2016:083717.Google Scholar
- Tan J, Doing G, Lewis KA, Price CE, Chen KM, Cady KC, et al. Unsupervised extraction of stable expression signatures from public compendia with an Ensemble of Neural Networks. Cell Syst. 2017;5:63–71.e6.View ArticlePubMedGoogle Scholar
- Engreitz JM, Daigle BJ, Marshall JJ, Altman RB. Independent component analysis: mining microarray data for fundamental human gene expression modules. J Biomed Inform. 2010;43:932–44.View ArticlePubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.View ArticlePubMedPubMed CentralGoogle Scholar
- Donato M, Xu Z, Tomoiaga A, Granneman JG, MacKenzie RG, Bao R, et al. Analysis and correction of crosstalk effects in pathway analysis. Genome Res. 2013;23:1885–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al. PID: the pathway interaction database. Nucleic Acids Res. 2009;37:D679.View ArticleGoogle Scholar
- Brunet J-P, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci. 2004;101:4164–9.View ArticlePubMedGoogle Scholar
- Kim PM, Tidor B. Subsystem identification through dimensionality reduction of large-scale gene expression data. Genome Res. 2003;13:1706–18.View ArticlePubMedPubMed CentralGoogle Scholar
- Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The Cancer genome atlas Pan-Cancer analysis project. Nat Genet. 2013;45:1113–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Visakh R, Nazeer KA. Identifying epigenetically dysregulated pathways from pathway–pathway interaction networks. Comput Biol Med. 2016;76:160–7.View ArticlePubMedGoogle Scholar
- Yang J-B, Luo R, Yan Y, Chen Y. Differential pathway network analysis used to identify key pathways associated with pediatric pneumonia. Microb Pathog. 2016;101:50–5.View ArticlePubMedGoogle Scholar
- Pham L, Christadore L, Schaus S, Kolaczyk ED. Network-based prediction for sources of transcriptional dysregulation using latent pathway identification analysis. Proc Natl Acad Sci. 2011;108:13347–52.View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Pan Q, Hu T, Andrew AS, Karagas MR, Moore JH. Bladder cancer specific pathway interaction networks. ECAL Citeseer. 2013:94–101.Google Scholar
- Li Y, Agarwal P, Rajagopalan D. A global pathway crosstalk network. Bioinformatics. 2008;24:1442–7.View ArticlePubMedGoogle Scholar
- de A-JG, Meja-Pedroza RA, Espinal-Enrquez J, Hernndez-Lemus E. Crosstalk events in the estrogen signaling pathway may affect tamoxifen efficacy in breast cancer molecular subtypes. Comput Biol Chem. 2015;59:42–54.View ArticleGoogle Scholar
- Pita-Juárez Y, Altschuler G, Kariotis S, Wei W, Koler K, Green C, et al. The pathway Coexpression network: revealing pathway relationships. PLoS Comput Biol. 2018;14:e1006042.View ArticlePubMedPubMed CentralGoogle Scholar
- Bell L, Chowdhary R, Liu JS, Niu X, Zhang J. Integrated bio-entity network: a system for biological knowledge discovery. PLoS One. 2011;6:e21474.View ArticlePubMedPubMed CentralGoogle Scholar
- Chowbina SR, Wu X, Zhang F, Li PM, Pandey R, Kasamsetty HN, et al. HPD: an online integrated human pathway database enabling systems biology studies. BMC Bioinformatics. 2009;10:S5.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu X, Chen JY. Molecular interaction networks: topological and functional characterizations. Autom Proteomics Genomics Eng Case-Based Approach. 2009;145Google Scholar
- Glass K, Girvan M. Finding new order in biological functions from the network structure of gene annotations. PLoS Comput Biol. 2015;11:e1004565.View ArticlePubMedPubMed CentralGoogle Scholar
- Abdi H, Williams LJ. Principal component analysis. Wiley Interdiscip Rev Comput Stat. 2010;2:433–59.View ArticleGoogle Scholar
- Stone JV. Independent component analysis. Wiley Online Library. 2004;Google Scholar
- Upton GJG. Fisher's exact test. J. Royal Stat. Soc. 1992;155:395–402.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Royal Stat Soc. 1995;57:289–300.Google Scholar
- van der WS, Colbert SC, Varoquaux G. The NumPy array: a structure for efficient numerical computation. Comput Sci Eng. 2011;13:22–30.View ArticleGoogle Scholar
- Bostock M. D3js Data Driven Doc. 2012:492.Google Scholar
- Stein-O’Brien GL, Arora R, Culhane AC, Favorov A, Greene C, Goff LA, et al. Enter the matrix: Interpreting unsupervised feature learning with matrix decomposition to discover hidden knowledge in high-throughput omics data. bioRxiv. 2017:196915.Google Scholar
- Stability YB. Bernoulli. 2013;19:1484–500.View ArticleGoogle Scholar
- Liu X, Long D, You H, Yang D, Zhou S, Zhang S, et al. Phosphatidylcholine affects the secretion of the alkaline phosphatase PhoA in Pseudomonas strains. Microbiol Res. 2016;192:21–9.View ArticlePubMedGoogle Scholar
- Ball G, ric D, Lazdunski A, Filloux A. A novel type II secretion system in Pseudomonas aeruginosa. Mol Microbiol. 2002;43:475–85.View ArticlePubMedGoogle Scholar
- Ball G, Viarre V, Garvis S, Voulhoux R, Filloux A. Type II-dependent secretion of a Pseudomonas aeruginosa DING protein. Res Microbiol. 2012;163:457–69.View ArticlePubMedGoogle Scholar
- Stephens DL, Choe MD, Earhart CF. Escherichia coli periplasmic protein FepB binds ferrienterobactin. Microbiology. 1995;141:1647–54.View ArticlePubMedGoogle Scholar
- Ellison ML, IIII JMF, Parrish W, Danell AS, Pesci EC. The transcriptional regulator Np20 is the zinc uptake regulator in Pseudomonas aeruginosa. PLoS One. 2013;8:e75389.View ArticlePubMedPubMed CentralGoogle Scholar
- Winsor GL, Lam DK, Fleming L, Lo R, Whiteside MD, Nancy YY, et al. Pseudomonas genome database: improved comparative analysis and population genomics capability for Pseudomonas genomes. Nucleic Acids Res. 2010:gkq869.Google Scholar
- Althaus EW, Outten CE, Olson KE, Cao H, O'Halloran TV. The ferric uptake regulation (Fur) repressor is a zinc metalloprotein. Biochemistry. 1999;38:6559–69.View ArticlePubMedGoogle Scholar
- Ma Z, Faulkner MJ, Helmann JD. Origins of specificity and cross-talk in metal ion sensing by Bacillus subtilis Fur. Mol Microbiol. 2012;86:1144–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Bailey P, Chang DK, Nones K, Johns AL, Patch A-M, Gingras M-C, et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature. 2016;531:47–52.View ArticlePubMedGoogle Scholar
- Network CGAR. Integrated genomic analyses of ovarian carcinoma. Nature 2011;474:609–615.Google Scholar
- Yuan Y, Allen EMV, Omberg L, Wagle N, Amin-Mansour A, Sokolov A, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol. 2014;32:644–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–74.View ArticlePubMedGoogle Scholar
- Moldovan G-L, D’Andrea AD. How the Fanconi Anemia pathway guards the genome. Annu Rev Genet. 2009;43:223–49.View ArticlePubMedPubMed CentralGoogle Scholar
- Sadasivam S, DeCaprio JA. The DREAM complex: master coordinator of cell cycle-dependent gene expression. Nat Rev Cancer. 2013;13:585–95.View ArticlePubMedPubMed CentralGoogle Scholar
- Tomida J, Itaya A, Shigechi T, Unno J, Uchida E, Ikura M, et al. A novel interplay between the Fanconi Anemia core complex and ATR-ATRIP kinase during DNA cross-link repair. Nucleic Acids Res. 2013;41:6930–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Itasaki N, Hoppler S. Crosstalk between Wnt and bone morphogenic protein signaling: a turbulent relationship. Dev Dyn. 2010;239:16–33.PubMedGoogle Scholar
- MacDonald BT, Tamai K, He X. Wnt/β-catenin signaling: components, mechanisms, and diseases. Dev Cell. 2009;17:9–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Aman A, Piotrowski T. Wnt/β-catenin and Fgf signaling control collective cell migration by restricting chemokine receptor expression. Dev Cell. 2008;15:749–61.View ArticlePubMedGoogle Scholar
- Petersen M, Pardali E, Horst GVD, Cheung H, Hoogen CVD, Pluijm GVD, et al. Smad2 and Smad3 have opposing roles in breast cancer bone metastasis by differentially affecting tumor angiogenesis. Oncogene. 2010;29:1351–61.View ArticlePubMedGoogle Scholar
- Yoon H, Dehart JP, Murphy JM, S-TS L. Understanding the roles of FAK in cancer: inhibitors, genetic models, and new insights. J Histochem Cytochem. 2015;63:114–28.View ArticlePubMedGoogle Scholar
- Chinnam M, Goodrich DW. RB1, development. and cancer Curr Top Dev Biol. 2011;94:129.View ArticlePubMedGoogle Scholar
- Takuwa Y, Du W, Qi X, Okamoto Y, Takuwa N, Yoshioka K. Roles of sphingosine-1-phosphate signaling in angiogenesis. World J Biol Chem. 2010;1:298–306.View ArticlePubMedPubMed CentralGoogle Scholar
- Watson C, Long JS, Orange C, Tannahill CL, Mallon E, McGlynn LM, et al. High expression of sphingosine 1-phosphate receptors, S1P 1 and S1P 3, sphingosine kinase 1, and extracellular signal-regulated kinase-1/2 is associated with development of tamoxifen resistance in estrogen receptor-positive breast cancer patients. Am J Pathol. 2010;177:2205–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Paik JH, Chae S, Lee M-J, Thangada S, Hla T. Sphingosine 1-phosphate-induced endothelial cell migration requires the expression of EDG-1 and EDG-3 receptors and rho-dependent activation of αvβ3-and β1-containing integrins. J Biol Chem. 2001;276:11830–7.View ArticlePubMedGoogle Scholar
- Serini G, Valdembri D, Bussolino F. Integrins and angiogenesis: a sticky business. Exp Cell Res. 2006;312:651–8.View ArticlePubMedGoogle Scholar
- Brogdon JL, Leitenberg D, Bottomly K. The potency of TCR signaling differentially regulates NFATc/p activity and early IL-4 transcription in naive CD4 T cells. J Immunol. 2002;168:3825–32.View ArticlePubMedGoogle Scholar
- Hsieh C-S, Macatonia SE, Tripp CS, Wolf SF, O’Garra A, Murphy KM. Development of TH1 CD4 T cells through IL-12. Science. 1993;260:547.View ArticlePubMedGoogle Scholar
- Vacaflores A, Chapman NM, Harty JT, Richer MJ, Houtman JC. Exposure of human CD4 T cells to IL-12 results in enhanced TCR-induced cytokine production, altered TCR signaling, and increased oxidative metabolism. PLoS One. 2016;11:e0157175.View ArticlePubMedPubMed CentralGoogle Scholar
- Lau E, Ze’ev AR. ATF2–at the crossroad of nuclear and cytosolic functions. J Cell Sci. 2012;125:2815–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaulian E, Karin M. AP-1 as a regulator of cell life and death. Nat Cell Biol. 2002;4:E136.View ArticleGoogle Scholar
- Mller MR, Rao A. NFAT, immunity and cancer: a transcription factor comes of age. Nat Rev Immunol. 2010;10:645–56.View ArticleGoogle Scholar
- McKinney W. Data structures for statistical computing in python. Proc. 9th Python Sci. Conf. van der Voort S, Millman J; 2010. p. 51–6.Google Scholar
- Jones E, Oliphant T, Peterson P. SciPy: open source scientific tools for Python. 2014;Google Scholar
- Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. Proc. 9th Python Sci. Conf. 2010:61.Google Scholar