Overview
In the medical domain, gene expression can be used as a biomarker to measure the functional state of a cell. One way in which drugs mediate their therapeutic or toxic effects is by altering gene expression. However, the assays needed to test how gene expression changes in response to a drug can be expensive and time consuming. Imputation has the potential to accelerate research by “recommending” novel gene-drug responses. Random walk methods can combine sparse heterogeneous graphs based on the principle of “guilt-by-association” [12]. Figure 1 provides an abstracted schematic of the proposed framework. Figure 2 provides a visualization of the input and output for the random walk with restart (RWR) method. Figure 3 presents a bird’s-eye view of the data processing, validation, and application steps performed in this study.
Data acquisition
The gene expression data come from two primary sources.
First, we acquired single-cell RNA-Seq (scRNA-Seq) expression data for 5 glioblastoma multiforme tumors [34] using the recount2 package for the R programming language [35] (ID: SRP042161). Since scRNA-Seq data are incredibly sparse, and since the random walk with restart algorithm is computationally expensive, we elected to remove genes that had zero values in more than 25% of cells. After pre-processing, our data contain 3022 gene features and 676 single cells. These cells belong to 5 patients, with 192, 97, 97, 193, and 97 cells per patient, respectively. Finally, we randomly split the cells into 5-folds per patient so that we could estimate the variability of our downstream analyses.
Second, we acquired gene expression data from the Library of Integrated Network-Based Cellular Signatures (LINCS) [36] using the Gene Expression Omnibus (GEO) [37] (ID: GSE70138). We split these LINCS data into smaller data sets based on the cell line ID under study. We a priori included the A375 (skin; malignant melanoma), HA1E (kidney; embryonic), HT29 (colon; adenocarcinoma), MCF7 (breast; adenocarcinoma), and PC3 (prostate; adenocarcinoma) cell lines because they were treated with the largest number of drugs.
Defining the gene co-expression network graphs
Although correlation is a popular choice for measuring gene co-expression, correlations can yield spurious results for next-generation sequencing data [38]. Instead, we calculate the proportionality between genes using the ϕs metric from the propr package for the R programming language [39]. Although this does not offer a perfect solution [40], studying gene-gene proportionality has a strong theoretical justification [38] and empirically outperforms other metrics of association for scRNA-Seq [41].
The proportionality metric describes the dissimilarity between any two genes, and ranges from [0,∞), where 0 indicates a perfect association. We converted this to a similarity measure ϕi that ranges from [0,1] by max-scaling ϕi=(max(ϕs)−ϕs)/max(ϕs), such that ϕi=1 when ϕs=0. A gene-gene matrix of ϕi scores is analogous to a gene-gene matrix of correlation coefficients, and constitutes our gene co-expression network. We calculated the ϕi co-expression network for the entire scRNA-Seq data set (1 network), for 5 folds of 5 patients (25 networks total), and for each of the 5 LINCS drug-free cell lines (5 networks total). All co-expression networks are available from https://zenodo.org/record/3522494.
Defining the bipartite graphs
Consider a graph G with V=1...N vertices, with positive and negative edges. The graphs used for our analyses are composed for two parts: a (general knowledge) bipartite graph and a (specific knowledge) fully-connected gene co-expression graph. For a bipartite graph, the vertex set V can be separated into two distinct sets, V1 and V2, such that no edges exist within either set. For a fully-connected (or complete) graph, there exists an edge between every pair of vertices within one set. For the graph G, the bipartite and fully-connected graphs are joined via the common vertex set V1 that contains genes and V2 contains annotations or drugs.
We constructed two types of bipartite graphs: the gene-annotation graph and the gene-drug graph. First, we made the gene-annotation graph from the Gene Ontology Biological Process database [9] via the AnnotationDbi and org.Hs.eg.db Bioconductor packages. An edge exists whenever a gene is associated with an annotation. Second, we made the gene-drug graphs using the LINCS data. For each cell line, we computed a gene-drug graph by calculating the log-fold change between the median of the drug-treated cell’s expression and the median of the drug-naive cell’s expression. This results in a fully-connected and weighted bipartite graph, where a large positive value means that the drug causes the gene to up-regulate (and vice versa). All bipartite graphs are available from https://zenodo.org/record/3522494.
Dual-channel random walk with restart (RWR)
Traditional RWR methods can only perform a random walk on graphs with positive edge weights [13]. Since the response of a gene to a drug is directional (up-regulated or down-regulated), we chose to use a modified RWR method, proposed by [42], that handles graphs with both positive and negative edge weights. Random walk requires transition probability matrices to decide the next step in the walk. The Chen et al. transition probability matrices can be computed based on the following equations:
$$ P\left(x_{j}|x_{i}\right) = \frac{|e_{ij}|}{\sum_{l\in N(x_{i})}|e_{il}|} $$
(1)
$$ P\left(x^{+}_{j}|x^{+}_{i}\right) = P\left(x^{-}_{j}|x^{-}_{i}\right) = \left\{\begin{array}{ll} P\left(x_{j}|x_{i}\right),& \text{if}\ e_{ij} >= 0\\ 0, & \text{otherwise} \end{array}\right. $$
$$ P\left(x^{-}_{j}|x^{+}_{i}\right) = P\left(x^{+}_{j}|x^{-}_{i}\right) = \left\{\begin{array}{ll} P\left(x_{j}|x_{i}\right),& \text{if}\ e_{ij} < 0\\ 0, & \text{otherwise} \end{array}\right. $$
These equations separate out the positive (and negative) transitions, and are used to calculate the total positive (and negative) information flow for each node. They are fixed for all steps.
Though the transition probabilities are computed separately, the information accumulated in a node depends on both the positive and negative information which flows through the node. For example, the positive information in a node depends on the negative information of any neighboring node connected by a negative edge weight (think: negative times negative is positive). Likewise, negative information in a node depends on the positive information in a neighboring node connected by a negative edge weight, and vice versa (think: negative times positive is negative). Figure 4 illustrates the information flow to a node xj from two neighbors.
The flow of information between the positive “plane” of the graph to the negative “plane” of the graph can be formulated with the equations:
$$ {}P\left(x^{+}_{j}\right)_{k} \,=\, \left[\!\sum_{x_{i}\in N(x_{i}) \& e_{ij}\geq0}P\!\left(x^{+}_{i}\right)_{k-1}P\!\left(x^{+}_{j}\mid x^{+}_{i}\right)\right] \!+ \!\left[\sum_{x_{i}\in N(x_{i}) \& e_{ij}<0}P\left(x^{-}_{i}\right)_{k-1}P\left(x^{+}_{j}\mid x^{-}_{i}\right)\right] $$
(2)
$$ {} P(x^{-}_{j})_{k} \,=\, \left[\sum_{x_{i}\in N(x_{i}) \& e_{ij}\geq0}P\left(x^{-}_{i}\right)_{k-1}P\left(x^{-}_{j}\mid x^{-}_{i}\right)\right] + \left[\sum_{x_{i}\in N(x_{i}) \& e_{ij}<0}P\left(x^{+}_{i}\right)_{k-1}P\left(x^{-}_{j}\mid x^{+}_{i}\right)\!\right] $$
(3)
where the probability \(P\left (x^{+}_{j}\right)_{k}\) is updated at each step k=2...10000.
RWR always considers a probability α to return back to the original nearest neighboring nodes at each step in the random walk. This is used to weigh the importance of node-specific information with respect to the whole graph, including for long walks:
$$ P_{rst}\left(x^{+}_{j}\right)_{k} = \left(1-\alpha\right)\times P\left(x^{+}_{j}\right)_{k-1} + \alpha\times P\left(x^{+}_{j}\right)_{2} $$
(4)
$$ P_{rst}\left(x^{-}_{j}\right)_{k} = \left(1-\alpha\right)\times P\left(x^{-}_{j}\right)_{k-1} + \alpha\times P\left(x^{-}_{j}\right)_{2} $$
(5)
where the restart probability \(P_{rst}\left (x^{+}_{j}\right)_{k}\) is updated at each step k=2...10000, and \(P\left (x^{+}_{j}\right)_{2}\) is the probability after the first update. These equations find the positive and negative restart information with respect to the node xj. Each \(P_{rst}\left (x^{+}_{j}\right)_{k}\) is a vector of probabilities that together sum to 1. This probability has two parts: the global information and the local information. The local information is the initial probability with respect to the nearest neighbors of node xj, and is denoted by \(P\left (x^{+}_{j}\right)_{2}\) [or \(P\left (x^{-}_{j}\right)_{2}\)] (i.e., the probability after the first update). The restart probability α is chosen from the range [0,1], where a higher value weighs the local information more than the global information. We chose α=0.1 to place a larger emphasis on the global information. This is also the value used by [42]. When applied to our synthetic data (see Additional file 1: Appendix), it produced expected results.
Analysis of random walk with restart (RWR) scores
For each gene, the RWR algorithm returns a vector of probabilities that together sum to 1. According to the guilt-by-association assumption, we interpret these probabilities to indicate the strength of the connection between the reference gene and each target. Since we are only interested in gene-annotation and gene-drug relationships, we exclude all gene-gene probabilities. Viewing the probability distribution as a composition (c.f., [43]), we perform a centered log-ratio transformation of each probability vector subset. This transformation normalizes the probability distributions so that we can compare them between samples [44, 45]. We define the RWR score \(r^{+}_{ga}\) (or \(r^{-}_{ga}\)) for each gene-annotation connection as the transform of its RWR probability:
$$ r^{+}_{ga} = \log \frac{p^{+}_{ga}}{\sqrt[A]{\prod_{i}^{A} p^{+}_{gi}}} $$
(6)
$$ r^{-}_{ga} = \log \frac{p^{-}_{ga}}{\sqrt[A]{\prod_{i}^{A} p^{-}_{gi}}} $$
(7)
for a bipartite graph describing g=1...G genes and a=1...A annotations (or A drugs), where \(\mathbf {p}^{+}_{g} = P_{rst}\left (x^{+}_{g}\right)=\left [p^{+}_{g1},..., p^{+}_{gA}\right ]\) (i.e., from the final step). These transformed RWR scores can be used for univariate statistical analyses, such as an analysis of variance (ANOVA) (c.f., analysis of compositional data [46, 47]).
Benchmark validation
Validation of gene-annotation prediction
Our strategy to validate RWR for gene-annotation prediction involves “hiding” known functional associations and seeing whether the RWR algorithm can re-discover them. This is done by turning 1s into 0s in the bipartite graph, a process we call “sparsification”. Our sparsification procedure works in 4 steps. First, we combine the original GO BP (or MF) bipartite graph with the master single-cell co-expression graph. Second, we subset the graph to include 25% of the gene annotations and 25% of the genes (this is done to reduce the computational overhead). Third, we randomly hide [10,25,50] percent of the gene-annotation connections from the bipartite sub-graph. Since this random selection could cause a feature to lose all connections, we use a constrained sampling strategy: the subsampled graph must contain at least one non-zero entry for each feature. Fourth, we apply the RWR algorithm to the sparsified and non-sparsified graphs, separately. We repeat this process 25 times, using a different random graph each time. By comparing the RWR scores between the hidden and unknown connections, we can determine whether our method rediscovers hidden connections.
Validation of gene-drug prediction
We use a different strategy to validate RWR for drug-response prediction. Since we have the gene-drug and gene-gene interaction data for 5 cell lines (A375, HA1E, HT29, MCF7 and PC3), we can set aside the known gene-drug responses for 1 cell line (PC3) as a “ground truth” test set. Then, we can use a composite of the remaining 4 gene-drug graphs to predict the gene-drug responses for the withheld cell line.
This is done in two steps. First, we use the averaged gene-drug data for 4 cell lines (a general drug graph) and the gene-gene data for PC3 (a specific gene graph) to impute the gene-drug response for PC3 (a specific drug graph). In the second step, we use the gene-drug data for PC3 (a specific drug graph) and its corresponding gene-gene data (a specific gene graph) to calculate the “ground truth” RWR scores for PC (a specific drug graph). The “ground truth” is the RWR scores when all PC3 drug-response experiments have been performed. With these two outputs, we can calculate the agreement between the imputed and “ground truth” RWR scores (using Spearman’s correlation, MSE, and accuracy).
Exploratory application of gene-drug prediction
Having demonstrated that RWR can perform well for single-cell co-expression networks, and can make meaningful drug-response predictions from composite LINCS data, we combine these heterogeneous data sources to make personalized drug-response predictions for individual single-cell networks. This requires some data munging. First, we transform the ENGS features used by the single-cell data into the HGNC features used by LINCS (only including genes with a 1-to-1 mapping, resulting in 181 genes). Second, we build an HGNC co-expression network with ϕi (for 5 folds of 5 patients, yielding 25 networks total). Third, we combine the composite LINCS gene-drug bipartite graph with each of the 25 HGNC single-cell networks. Fourth, we use our RWR algorithm to predict how 181 genes would respond to 1732 drugs for each patient fold. As above, we perform an analysis of variance (ANOVA) to detect inter-patient differences.