Personalized single-cell networks: a framework to predict the response of any gene to any drug for any patient

Background The last decade has seen a major increase in the availability of genomic data. This includes expert-curated databases that describe the biological activity of genes, as well as high-throughput assays that measure gene expression in bulk tissue and single cells. Integrating these heterogeneous data sources can generate new hypotheses about biological systems. Our primary objective is to combine population-level drug-response data with patient-level single-cell expression data to predict how any gene will respond to any drug for any patient. Methods We take 2 approaches to benchmarking a “dual-channel” random walk with restart (RWR) for data integration. First, we evaluate how well RWR can predict known gene functions from single-cell gene co-expression networks. Second, we evaluate how well RWR can predict known drug responses from individual cell networks. We then present two exploratory applications. In the first application, we combine the Gene Ontology database with glioblastoma single cells from 5 individual patients to identify genes whose functions differ between cancers. In the second application, we combine the LINCS drug-response database with the same glioblastoma data to identify genes that may exhibit patient-specific drug responses. Conclusions Our manuscript introduces two innovations to the integration of heterogeneous biological data. First, we use a “dual-channel” method to predict up-regulation and down-regulation separately. Second, we use individualized single-cell gene co-expression networks to make personalized predictions. These innovations let us predict gene function and drug response for individual patients. Taken together, our work shows promise that single-cell co-expression data could be combined in heterogeneous information networks to facilitate precision medicine. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-021-00263-w).


Introduction
genes [31], and functional similarities between genes [32]. Bi-random walk, another random walk variant, has been used to rank disease genes from a protein-protein interaction network [33].
In contrast to the previous works, which make use of population-level graphs, we apply RWR to patient-level graphs, allowing us to make predictions about gene behavior that are personalized to each patient. We take 2 approaches to benchmarking "dual-channel" RWR for data integration. First, we evaluate how well RWR can predict known gene functions from single-cell gene co-expression networks. Second, we evaluate how well RWR can predict known drug responses from individual cell networks. We then present two exploratory applications. In the first application, we combine the Gene Ontology database with glioblastoma single cells from 5 individual patients to identify genes whose functions differ between cancers. In the second application, we combine the LINCS drugresponse database with the same glioblastoma data to identify genes that may exhibit patient-specific drug responses. Taken together, our work shows promise that singlecell co-expression data could be combined in heterogeneous information networks to facilitate precision medicine.

Overview
In the medical domain, gene expression can be used as a biomarker to measure the functional state of a cell. One way in which drugs mediate their therapeutic or toxic effects is by altering gene expression. However, the assays needed to test how gene expression changes in response to a drug can be expensive and time consuming. Imputation has the potential to accelerate research by "recommending" novel gene-drug responses. Random walk methods can combine sparse heterogeneous graphs based on the principle of "guilt-byassociation" [12]. Figure 1 provides an abstracted schematic of the proposed framework. Figure 2 provides a visualization of the input and output for the random walk with restart (RWR) method. Figure 3 presents a bird's-eye view of the data processing, validation, and application steps performed in this study.

Data acquisition
The gene expression data come from two primary sources.
First, we acquired single-cell RNA-Seq (scRNA-Seq) expression data for 5 glioblastoma multiforme tumors [34] using the recount2 package for the R programming language [35] Fig. 1 An abstracted schematic of the proposed framework. Expert-curated databases like Gene Ontology (GO) and the Library of Integrated Network-Based Cellular Signatures (LINCS) can provide some general knowledge about biological activity. High-throughput single-cell sequencing assays can provide specific knowledge for an individual patient. Random walk with restart (RWR) can combine these heterogeneous data sources to provide specific knowledge about biological activity for an individual patient. This framework allows us to predict how any gene will respond to any drug for any patient Fig. 2 Our goal is to combine a generic gene-based bipartite graph with the auxiliary knowledge of a fully-connected personalized graph. RWR will impute the missing links (and update the existing links) by "walking through" the auxiliary information. The left panel shows a (sparsely-connected) gene-drug graph combined with a (fully-connected) gene-gene graph, where d i represents the drugs and g i represents the genes. The example gene-drug network has missing links. The right panel shows the output of RWR: a complete network of newly predicted gene-drug interactions. Here, the missing link between any drug d i and any gene g i is replaced with a new link. The method works based on the principle of "guilt-by-association": the value of the new d i − g i links will be large if g i is strongly connected to genes that are also connected to d i . When the gene-drug graph is fully-connected, RWR will instead "update" the importance of each connection. Note that, with respect to this figure, the gene-annotation bipartite graph is conceptually equivalent to the gene-drug bipartite graph, except that many edges will have zero weight (ID: SRP042161). Since scRNA-Seq data are incredibly sparse, and since the random walk with restart algorithm is computationally expensive, we elected to remove genes that had zero values in more than 25% of cells. After pre-processing, our data contain 3022 gene features and 676 single cells. These cells belong to 5 patients, with 192, 97, 97, 193, and 97 cells per patient, respectively. Finally, we randomly split the cells into 5-folds per patient so that we could estimate the variability of our downstream analyses.
Second, we acquired gene expression data from the Library of Integrated Network-Based Cellular Signatures (LINCS) [36] using the Gene Expression Omnibus (GEO) [37] (ID: GSE70138). We split these LINCS data into smaller data sets based on the cell line ID under study. We a priori included the A375 (skin; malignant melanoma), HA1E (kidney; Fig. 3 A bird's-eye view of the data collection, integration, and analysis steps performed in this study. We use RWR to combine general knowledge with some specific knowledge about a sample. We separately use gene function and drug-response data as the source of general knowledge. We use co-expression networks as the source of specific knowledge. By combining the drug-response data with an individualized single-cell network, we can make predictions about gene behavior that are personalized to each patient. Concept nodes are colored by activity type: data processing (gray), validation of gene-annotation prediction (green), validation of gene-drug prediction (yellow), and exploratory application (orange) embryonic), HT29 (colon; adenocarcinoma), MCF7 (breast; adenocarcinoma), and PC3 (prostate; adenocarcinoma) cell lines because they were treated with the largest number of drugs.

Defining the gene co-expression network graphs
Although correlation is a popular choice for measuring gene co-expression, correlations can yield spurious results for next-generation sequencing data [38]. Instead, we calculate the proportionality between genes using the φ s metric from the propr package for the R programming language [39]. Although this does not offer a perfect solution [40], studying gene-gene proportionality has a strong theoretical justification [38] and empirically outperforms other metrics of association for scRNA-Seq [41].
The proportionality metric describes the dissimilarity between any two genes, and ranges from [ 0, ∞), where 0 indicates a perfect association. We converted this to a similarity measure φ i that ranges from [ 0, 1] by max- such that φ i = 1 when φ s = 0. A gene-gene matrix of φ i scores is analogous to a genegene matrix of correlation coefficients, and constitutes our gene co-expression network. We calculated the φ i co-expression network for the entire scRNA-Seq data set (1 network), for 5 folds of 5 patients (25 networks total), and for each of the 5 LINCS drug-free cell lines (5 networks total). All co-expression networks are available from https://zenodo. org/record/3522494.

Defining the bipartite graphs
Consider a graph G with V = 1...N vertices, with positive and negative edges. The graphs used for our analyses are composed for two parts: a (general knowledge) bipartite graph and a (specific knowledge) fully-connected gene co-expression graph. For a bipartite graph, the vertex set V can be separated into two distinct sets, V 1 and V 2 , such that no edges exist within either set. For a fully-connected (or complete) graph, there exists an edge between every pair of vertices within one set. For the graph G, the bipartite and fully-connected graphs are joined via the common vertex set V 1 that contains genes and V 2 contains annotations or drugs.
We constructed two types of bipartite graphs: the gene-annotation graph and the gene-drug graph. First, we made the gene-annotation graph from the Gene Ontology Biological Process database [9] via the AnnotationDbi and org.Hs.eg.db Bioconductor packages. An edge exists whenever a gene is associated with an annotation. Second, we made the gene-drug graphs using the LINCS data. For each cell line, we computed a genedrug graph by calculating the log-fold change between the median of the drug-treated cell's expression and the median of the drug-naive cell's expression. This results in a fullyconnected and weighted bipartite graph, where a large positive value means that the drug causes the gene to up-regulate (and vice versa). All bipartite graphs are available from https://zenodo.org/record/3522494.

Dual-channel random walk with restart (RWR)
Traditional RWR methods can only perform a random walk on graphs with positive edge weights [13]. Since the response of a gene to a drug is directional (up-regulated or down-regulated), we chose to use a modified RWR method, proposed by [42], that handles graphs with both positive and negative edge weights. Random walk requires tran-sition probability matrices to decide the next step in the walk. The Chen et al. transition probability matrices can be computed based on the following equations: These equations separate out the positive (and negative) transitions, and are used to calculate the total positive (and negative) information flow for each node. They are fixed for all steps.
Though the transition probabilities are computed separately, the information accumulated in a node depends on both the positive and negative information which flows through the node. For example, the positive information in a node depends on the negative information of any neighboring node connected by a negative edge weight (think: negative times negative is positive). Likewise, negative information in a node depends on the positive information in a neighboring node connected by a negative edge weight, and vice versa (think: negative times positive is negative). Figure 4 illustrates the information flow to a node x j from two neighbors.
The flow of information between the positive "plane" of the graph to the negative "plane" of the graph can be formulated with the equations: . Similarly, the negative flow of negative information can contribute to positive information. The "dual-channel" RWR algorithm incorporates these edge weights. Note that in practice each node will contain both positive and negative information where the probability P x + j k is updated at each step k = 2...10000. RWR always considers a probability α to return back to the original nearest neighboring nodes at each step in the random walk. This is used to weigh the importance of nodespecific information with respect to the whole graph, including for long walks: where the restart probability P rst x + j k is updated at each step k = 2...10000, and P x + j 2 is the probability after the first update. These equations find the positive and negative restart information with respect to the node x j . Each P rst x + j k is a vector of probabilities that together sum to 1. This probability has two parts: the global information and the local information. The local information is the initial probability with respect to the nearest neighbors of node x j , and is denoted by P ] (i.e., the probability after the first update). The restart probability α is chosen from the range [ 0, 1], where a higher value weighs the local information more than the global information. We chose α = 0.1 to place a larger emphasis on the global information. This is also the value used by [42]. When applied to our synthetic data (see Additional file 1: Appendix), it produced expected results.

Analysis of random walk with restart (RWR) scores
For each gene, the RWR algorithm returns a vector of probabilities that together sum to 1. According to the guilt-by-association assumption, we interpret these probabilities to indicate the strength of the connection between the reference gene and each target. Since we are only interested in gene-annotation and gene-drug relationships, we exclude all gene-gene probabilities. Viewing the probability distribution as a composition (c.f., [43]), we perform a centered log-ratio transformation of each probability vector subset. This transformation normalizes the probability distributions so that we can compare them between samples [44,45]. We define the RWR score r + ga (or r − ga ) for each gene-annotation connection as the transform of its RWR probability: for a bipartite graph describing g = 1...G genes and a = 1...A annotations (or A drugs), where p + g = P rst x + g = p + g1 , ..., p + gA (i.e., from the final step). These transformed RWR scores can be used for univariate statistical analyses, such as an analysis of variance (ANOVA) (c.f., analysis of compositional data [46,47]).

Validation of gene-annotation prediction
Our strategy to validate RWR for gene-annotation prediction involves "hiding" known functional associations and seeing whether the RWR algorithm can re-discover them. This is done by turning 1s into 0s in the bipartite graph, a process we call "sparsification". Our sparsification procedure works in 4 steps. First, we combine the original GO BP (or MF) bipartite graph with the master single-cell co-expression graph. Second, we subset the graph to include 25% of the gene annotations and 25% of the genes (this is done to reduce the computational overhead). Third, we randomly hide [ 10,25,50] percent of the gene-annotation connections from the bipartite subgraph. Since this random selection could cause a feature to lose all connections, we use a constrained sampling strategy: the subsampled graph must contain at least one non-zero entry for each feature. Fourth, we apply the RWR algorithm to the sparsified and non-sparsified graphs, separately. We repeat this process 25 times, using a different random graph each time. By comparing the RWR scores between the hidden and unknown connections, we can determine whether our method rediscovers hidden connections.

Validation of gene-drug prediction
We use a different strategy to validate RWR for drug-response prediction. Since we have the gene-drug and gene-gene interaction data for 5 cell lines (A375, HA1E, HT29, MCF7 and PC3), we can set aside the known gene-drug responses for 1 cell line (PC3) as a "ground truth" test set. Then, we can use a composite of the remaining 4 gene-drug graphs to predict the gene-drug responses for the withheld cell line. This is done in two steps. First, we use the averaged gene-drug data for 4 cell lines (a general drug graph) and the gene-gene data for PC3 (a specific gene graph) to impute the gene-drug response for PC3 (a specific drug graph). In the second step, we use the genedrug data for PC3 (a specific drug graph) and its corresponding gene-gene data (a specific gene graph) to calculate the "ground truth" RWR scores for PC (a specific drug graph). The "ground truth" is the RWR scores when all PC3 drug-response experiments have been performed. With these two outputs, we can calculate the agreement between the imputed and "ground truth" RWR scores (using Spearman's correlation, MSE, and accuracy).

Exploratory application of gene-drug prediction
Having demonstrated that RWR can perform well for single-cell co-expression networks, and can make meaningful drug-response predictions from composite LINCS data, we combine these heterogeneous data sources to make personalized drug-response predictions for individual single-cell networks. This requires some data munging. First, we transform the ENGS features used by the single-cell data into the HGNC features used by LINCS (only including genes with a 1-to-1 mapping, resulting in 181 genes). Second, we build an HGNC co-expression network with φ i (for 5 folds of 5 patients, yielding 25 networks total). Third, we combine the composite LINCS gene-drug bipartite graph with each of the 25 HGNC single-cell networks. Fourth, we use our RWR algorithm to predict how 181 genes would respond to 1732 drugs for each patient fold. As above, we perform an analysis of variance (ANOVA) to detect inter-patient differences.

Why use single cells?
In this study, we analyze a previously published single-cell data set that measured the gene expression for 5 glioblastoma patients. A principal components analysis of these data show that the major axes of variance tend to group the cells according to the patientof-origin. Indeed, an ANOVA of gene expression with respect to patient ID reveals that 2204 of the 3022 genes have significantly different expression in at least one patient (FDRadjusted p < .05). This suggests that single-cell gene expression is unique to each patient.
Although it is possible to obtain sample-specific gene expression using bulk RNA-Seq, our approach to data integration exploits the graphical structure of sample-specific gene co-expression networks. scRNA-Seq, generating multiple measurements per individual, makes the computation of sample-specific co-expression networks straightforward (though others have proposed ways to estimate these from bulk RNA-Seq [48,49]).

Validation of gene-annotation prediction
The Gene Ontology (GO) project has curated a database which relates genes to biological processes (BP) and molecular functions (MF) (called annotations). The GO database has widespread use in bioinformatics for assigning "functional" relevance to sets of gene biomarkers [50]. Although GO organizes the semantic relationships between annotations as a directed acyclic graph, we could more simply represent the relationships as a bipartite graph. By combining a (fully-connected) gene co-expression graph with a (sparselyconnected) gene-annotation bipartite graph, RWR can predict new gene-annotation connections.
To test whether the RWR predictions are meaningful, we "hid" a percentage of known gene-annotation links (by turning 1s into 0s in the bipartite graph), and compared the RWR scores for the hidden gene-annotation links against those for the unknown links (see Methods for a definition of the RWR score). Figure 5 shows that the RWR scores for hidden connections are appreciably larger than for the unknown connections, confirming that RWR can discover real gene-annotation relationships from a single-cell gene co-expression network.

Exploratory application of gene-annotation prediction
Since single-cell RNA-Seq assays measure RNA for multiple cells per patient, we can use these data to build a personalized graph that describes the gene-gene relationships for an individual patient. In order to estimate the variation in these personalized graphs, we divided the cells from each sample into 5 folds (giving us 5 networks per-patient). Above, we show that RWR can discover real gene-annotation relationships. By combining the personalized graph (a kind of specific knowledge) with a gene-annotation bipartite graph (a kind of general knowledge), the RWR algorithm will score the gene-annotation connections for a given patient. From this, we can identify genes that may have a different functional importance in one cancer versus the others.
Taking a subset of the 50 genes with the largest inter-patient differences, we use RWR to compute personalized RWR scores. This results in 25 matrices (for 5 folds of 5 patients), each with 50 rows (for genes) and 369 columns (for BP annotations). Performing an ANOVA on each gene-annotation connection results in a matrix of 50x369 p-values. Figure 6 shows a heatmap of the significant gene-annotation connections (dark red indi- Fig. 5 The RWR scores for the hidden and unknown gene-annotation connections (faceted by the amount of sparsity). When known connections are hidden, RWR tends to give higher scores than when connections are unknown, suggesting that RWR can discover real gene-annotation relationships. Note that GO is an incomplete database: the absence of a gene-annotation connection is not the evidence of absence. For this reason, we do know whether the high scoring "Unknown" connections are false positives or previously undiscovered connections. All t-tests comparing "Hidden" and "Unknown" connections have p < 10 −15 cates a gene-wise FDR-adjusted p < .05). Figure 7A plots the per-patient RWR scores for 4 annotations of the BCL-6 gene that significantly differ between patients. BCL-6 is an important biomarker whose increased expression is associated with worse outcomes in glioblastoma [51]. Our analysis suggests that BCL-6 could have a larger role in inflammation for patients 3 and 5, but a larger role in cartilage development and translational elongation in patient 1. Of course, this hypothesis requires experimental validation.

Validation of gene-drug prediction
The NIH LINCS program has generated a large amount of data on how the gene expression signatures of cell lines change in response to a drug. By conceptualizing the baseline (drug-free) gene co-expression network as a complete graph of specific knowledge, and by re-factoring the average gene-drug response as a (weighted) bipartite graph of general knowledge, we can apply the same RWR algorithm to predict a cell's gene expression response to any drug. Since the modified RWR algorithm contains two channels-a To test whether RWR can make accurate predictions about how a gene in a cell would respond to a drug, we ran the RWR algorithm on the baseline (drug-free) gene coexpression graph of the PC3 cell line using a composite gene-drug graph of 4 different cell lines. We then compared these RWR scores with a "ground truth" (i.e., the RWR scores for when all PC3 drug-response experiments have been performed). The agreement between the composite gene-drug RWR scores and the "ground truth" gene-drug RWR scores tells us how well the composite gene-drug map generalizes to new cell types. Table 1 shows that agreement is high, especially for the top up-regulation and down-regulation events. This confirms that our composite gene-drug graph is useful for drug-response prediction.

Exploratory application of gene-drug prediction
The RWR algorithm can combine specific knowledge and general knowledge from disparate sources to make personalized recommendations. This makes RWR a potentially valuable tool for precision medicine.
As an exploratory analysis, we combine the personalized gene co-expression networks with the composite gene-drug graph from LINCS. By running the RWR algorithm on these two data streams, the RWR scores will now suggest how the expression of any gene might change in response to any drug for each of the 5 glioblastoma patients. Using an ANOVA, we identify hundreds of gene-drug connections with RWR scores that differ significantly between patients (gene-wise FDR-adjusted p < .05). Figure 7B shows an example of drugs that have different (negative channel) RWR scores for EGFR. It suggests that the anti-inflammatory drug valdecoxib and the anti-neoplastic drug salirasib may cause a stronger down-regulation of EGFR (a pan-cancer oncogene [52]) in patients 1 and 4 versus the others. The Supplementary Information includes a complete table of the unadjusted ANOVA p-values for the gene-drug inter-patient differences available in https://zenodo.org/record/3743897.

Limitations
We deployed our framework on only 5 individual patients. As such, we lack a sufficient sample size to test whether any inter-patient differences could be explained by known demographic or clinical phenotypes. It is worth noting that cancer cells are very heterogeneous and, depending on the location of the sample collection, the composition of cell types (and thus gene expression profiles) can change dramatically. As such, factors other than the patient-specific tumour profile, such as batch effects, could account for differences in the sample-specific gene co-expression networks. Such differences may be difficult to account for without careful experimental design and standardization. Although the scope of this paper is to prove the concept, we wish to remind the readers Table 1 Overall agreement (Spearman's correlation and MSE) and the accuracy of the overlap (for the top 5%, 10%, 25%, and 50% predicted scores), as calculated separately for the positive and negative channels. Overall, agreement is high, especially for the top up-regulation and down-regulation events that much care should be taken when translating the methodology to real-world clinical problem-solving. In the absence of experimental validation, we support our analyses using 2 forms of in silico validation, which together demonstrate that RWR can integrate sparse heterogeneous data to discover real biological activity. Although we find the in silico validation encouraging, we acknowledge that RWR is merely a prediction tool that recommends hypotheses, and that these predictions may change when the source of general knowledge changes. Experimental validation is needed to determine whether these hypotheses prove true in practice. Further work is needed to validate the clinical relevance of the proposed framework.

Conclusions
This manuscript describes a framework for combining patient-specific single-cell networks with public drug-response data to make personalized predictions about drug response. Importantly, our approach makes use of a generic framework, and so can be applied to combine many kinds of data. We think the targeted analysis of personalized single-cell networks is promising, and could offer a new direction for precision medicine research.
We conclude with some perspectives on what the future of personalized network analysis may hold. Although RWR can handle sparse heterogeneous data, the positive and negative information obtained for each node can be infinitesimally small. One might address this by first transforming the RWR probabilities. Otherwise, we note that RWR is computationally expensive, making the analysis of high-dimensional data prohibitively slow. One might address this by pre-training a deep neural network to provide an approximate RWR solution. These improvements could help scale personalized predictions to larger graphs.