 Methodology
 Open Access
 Published:
Joint analysis of multiple highdimensional data types using sparse matrix approximations of rank1 with applications to ovarian and liver cancer
BioData Mining volume 9, Article number: 24 (2016)
Abstract
Background
Technological advances enable the costeffective acquisition of MultiModal Data Sets (MMDS) composed of measurements for multiple, highdimensional data types obtained from a common set of biosamples. The joint analysis of the data matrices associated with the different data types of a MMDS should provide a more focused view of the biology underlying complex diseases such as cancer that would not be apparent from the analysis of a single data type alone. As multimodal data rapidly accumulate in research laboratories and public databases such as The Cancer Genome Atlas (TCGA), the translation of such data into clinically actionable knowledge has been slowed by the lack of computational tools capable of analyzing MMDSs. Here, we describe the Joint Analysis of Many Matrices by ITeration (JAMMIT) algorithm that jointly analyzes the data matrices of a MMDS using sparse matrix approximations of rank1.
Methods
The JAMMIT algorithm jointly approximates an arbitrary number of data matrices by rank1 outerproducts composed of “sparse” leftsingular vectors (eigenarrays) that are unique to each matrix and a rightsingular vector (eigensignal) that is common to all the matrices. The nonzero coefficients of the eigenarrays identify small subsets of variables for each data type (i.e., signatures) that in aggregate, or individually, best explain a dominant eigensignal defined on the columns of the data matrices. The approximation is specified by a single “sparsity” parameter that is selected based on false discovery rate estimated by permutation testing. Multiple signals of interest in a given MDDS are sequentially detected and modeled by iterating JAMMIT on “residual” data matrices that result from a given sparse approximation.
Results
We show that JAMMIT outperforms other joint analysis algorithms in the detection of multiple signatures embedded in simulated MDDS. On real multimodal data for ovarian and liver cancer we show that JAMMIT identified multimodal signatures that were clinically informative and enriched for cancerrelated biology.
Conclusions
Sparse matrix approximations of rank1 provide a simple yet effective means of jointly reducing multiple, big data types to a small subset of variables that characterize important clinical and/or biological attributes of the biosamples from which the data were acquired.
Background
Advances in array technology, highthroughput sequencing, and clinical imaging platforms enable the measurement of ten’s of thousands of variables of a specific data type in a fixed set of tissue samples [1–4]. Such “big” data types include genomewide measurements of messenger RNA (mRNA) and microRNA expression, DNA methylation, single nucleotide polymorphisms (SNPs), nextgeneration sequence data, and quantitative features extracted from Positron Emission Tomography (PET) images.
The measurement of p > 1 variables of a given data type obtained from a collection of n > 1 samples can be organized into a p × n data matrix D with rows representing variables and columns representing measurements of the p variables in each of the n samples. For big data types we have p ≫ n, making such “tall and thin” matrices difficult to analyze using standard statistical techniques due to a severe multiple comparisons problem and low SignaltoNoise Ratio (SNR) [1, 5, 6]. The low SNR is due in large part to the relatively small number of variables (out of many thousands measured) that truly represent a Signal of Interest (SOI) in the data that is associated with an important biological and/or clinical attribute of the samples. In this context, we are interested in selecting s > 0 rows of D that best approximate a dominant SOI in the rowspace of D that may represent a clinically and/or biologically significant attribute of the samples. We call this subset of variables a signature in \( D \), and if \( D \) is big, then we assume that the signature is “sparse” in \( D \), i.e., s ≪ p.
MMDSs pose even greater analytical challenges since the goal is to jointly analyze two or more data matrices in an integrated manner, which exacerbates problems related to data dimensionality and SNR ‘[1, 2, 7]. As before, the goal is to detect sparse signatures for each data type that individually, or in combination, explain a SOI that characterizes an important biological and/or clinical attribute of the samples. Unfortunately, the lack of analytical tools for the joint analysis of multiple data types has slowed the discovery of novel predictive biomarkers and therapeutic targets that account for interactions between networks of diverse molecular species across space and time. Falling data acquisition costs have resulted in MMDS accumulating at an exponential rate in academic research laboratories, private industry, and public data repositories such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) [3, 8, 9]. This growing inventory of multimodal data presents a major analytical bottleneck in the translation of big, genomic data sets into clinically actionable knowledge.
Formally, the measurements for K > 1 different data types collected from a common set of n biospecimens, S_{ n } = {ς_{1}, ς_{2}, …, ς_{ n }}, can be represented by a collection of K data matrices, \( \mathfrak{D}={\left\{{D}_k\right\}}_{k=1}^K \), where: i) D_{ k } is the p_{ k } × n data matrix representing measurements for the kth data type; and ii) at least one of the D_{ k } is big, i.e., p_{ k } > > n. We assume that each D_{ k } has been appropriately preprocessed as function of its data type. For example, preprocessing of mRNA data would likely involve log2transformation, quantile normalization, and rowcentering, while a methylation data matrix would be transformed from Betavalues to Mvalues prior to normalization and rowcentering [10, 11]. Following Friedland and others [12–14], let \( D=D\left(\mathfrak{D}\right) \) be the p × n supermatrix that vertically “stacks” each of the preprocessed p_{ k } × n matrices \( {D}_k\in \mathfrak{D} \) along their columns where p = ∑ ^{K}_{k = 1} p_{ k }. We assume that D is appropriately scaled by its Frobenius norm to account for differences in the number of rows and dynamic range of the different D_{ k }’s. Then the joint analysis of \( \mathfrak{D} \) involves the identification of s > 0 rows of the supermatrix D that models a univariate SOI in the rowspace of D as a linear combination of the selected rows. The set of s variables associated with the selected rows define a MultiModal SIGnature (MMSIG) of D denoted by ζ where s = dim(ζ). If the SOI is highly correlated with an important biological or clinical attribute of the samples, then ζ explains and helps to interpret the sample attribute of interest in terms of the selected variables. Note that since D is big (i.e., p > > n), we want ζ to be sparse in D, (i.e., s ≪ p) to facilitate downstream interpretation and model validation. [15].
Matrix approximations of rank1 provide an efficient way of jointly analyzing the matrices of \( \mathfrak{D} \) [16–18]. For example, assume the supermatrix D has rank R > 0 and let D = ∑ ^{R}_{r = 1} u_{ r }σ_{ r }v ^{T}_{ r } be the Singular Value Decomposition (SVD) of D where: a) u_{ r } ∈ ℝ^{P} is the rth leftsingular vector (i.e., the rth eigenarray); b) v_{ r } ∈ ℝ^{n} is the rth rightsingular vector (i.e., the rth eigensignal); and c) σ_{ r } ∈ ℝ is the rth singular value for i = 1, 2, …, R. Then the outerproduct, u_{1}σ_{1}v ^{T}_{1} , is the best rank1 approximation of D in a least squares sense and v_{1} represents the dominant SOI on the columns of D that is linearly modeled in terms of the p rows of D weighted by the “loading” coefficients of u_{1} [16]. Let ζ_{ SVD } denote the signature that selects the rows of D with nonzero coefficients in u_{1}. If D is big, then p = dim(ζ_{ SVD }) is large since the SVD in general assigns a nonzero loading to each row of D, which poses problems for downstream validation and interpretation of v_{1} in terms of the p variables of ζ_{ SVD }.
Instead, we apply the BET ON SPARSITY (BEST) principle that states that if p > > n, then it is best to assume that v_{1} is sparsely supported by a small number of rows of D, and employ an ℓ_{1} penalty to identify these rows [19]. If the sparsity assumption is true, then v_{1} will be optimally modeled by the selected rows; otherwise no method will be able to recover the underlying model without many more samples (i.e., Bellman’s curse of dimensionality [20].) Taking the BEST approach, we developed the Joint Analysis of Many Matrices by ITeration (JAMMIT) algorithm that approximates D by the rank1 outerproduct, D ≈ uv^{T}, where u ∈ ℝ^{p} is a sparse eigenarray of “loading” coefficients and v ∈ ℝ^{n} is nonsparse, eigensignal of “scores” that potentially explains an important biological and/or clinical attribute of the samples [21, 22]. The algorithm uses an “asymmetric” version of the Least Absolute Shrinkage and Selection Operator (LASSO) that regularizes u but not v as a function of a ℓ_{1} penalty term selected based on false discovery rate (FDR). The small number of nonzero coefficients of u define a sparse MMSIG in D that supports a sdimensional, linear model of v such that s ≪ p. Since a given MMDS is likely to contain multiple SOIs of biological or clinical relevance, the JAMMIT algorithm is iteratively applied to the residuals of the current model to identify and select any additional SOI that may be present in the data (see Methods Section under The JAMMIT algorithm for more details). Figure 1 shows a specific instance of a JAMMIT analysis of three big data types for ovarian cancer downloaded from TCGA. Here, the information processing flows from left to right in five steps illustrating how three large data matrices are reduced to three relatively small typespecific signatures shown in step 4. Also shown is postJAMMIT processing illustrating the additional pathway and matrix analysis that is needed to further reduce signature dimensionality without the loss of information. We note that the entire processing chain results in mRNA signatures that associate immune checkpoint signaling in the tumor microenvironment with response to chemotherapy.
Other methods based on matrix factorizations have been proposed for the joint analysis of multiple data types such as the Generalized Singular Value Decomposition (GSVD), Joint and Individual Variation Explained (JIVE), DISCOSCA, Partial Least Squares (PLS), and Canonical Correlation Analysis (CCA) [12, 13, 18, 23–25]. These methods suffer from the same problem as the SVD in that they minimize the ℓ_{2} norm of the estimation error and assign nonzero weights to all p rows of D [26]. A number of techniques can be used to reduce the dimensionality of the selected model such as: i) rotation of principal components as implemented in factor analysis; ii) ignoring loadings smaller than some threshold; and iii) restricting the range of the loadings to a small discrete set of values [21, 27]. Unfortunately, these methods are prone to high false positive rates and poor sensitivity especially in situations where the SNR is low. Regularized versions of Principal Components Analysis (PCA), SVD, CCA, and PLS have been proposed for sparse signal detection and dimensionality reduction, but application of these methods to the supermatrix that “stacks” an arbitrary number of data matrices is not explicitly discussed [21, 26, 28–30]. Finally, many of the methods outlined above focus on maximal rankk approximations of D where k is significantly greater than one, which precludes the use of resampling methods in the selection of the best ℓ_{1} penalty due to the high computational cost [30].
In what follows, we describe in greater detail a workflow for the joint analysis of multiple data types based on the JAMMIT algorithm. A section on methods provides technical detail on the algorithm and the computational tools used to evaluate the statistical significance, biological coherence, and clinical relevance of JAMMITderived signatures. We then present and discuss results of: 1) a study that compared JAMMIT detection performance against that of other joint analysis algorithms on simulated data; ii) a JAMMIT analysis of global mRNA, microRNA and DNA methylation data for ovarian cancer downloaded from TCGA; and iii) a JAMMIT analysis of wholegenome mRNA data for liver cancer supervised by quantitative features derived from PET imaging data. A discussion and conclusions are presented in a final section.
Methods
Joint Analysis of Many Matrices by Iteration (JAMMIT)
Let D = {D_{ k }} ^{K}_{k = 1} denote a collection of p_{ k } × n data matrices D_{ k } that represents a MMDS acquired from a common set of n biospecimens, S_{ n } = {ς_{1}, ς_{2}, …, ς_{ n }}. Let D = stack(D) denote the p × n supermatrix of D where p = ∑ ^{m}_{k = 1} p_{ k }. We assume that at least one D_{ k } is big, so that the supermatrix D is also big. We assume each D_{ k } has been individually preprocessed as a function of its data type as discussed in the previous section and that D is scaled by its Frobenius norm such that if \( D=\left[{d}_{ij}\right] \) is a \( p\times n \) matrix, then \( D\leftarrow D\bullet / {\left\Vert D\right\Vert}_{Frob} \) where: 1) \( {\left\Vert D\right\Vert}_{Frob}={\left({\sum}_i\sum_j{\left{d}_{ij}\right}^2\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{1ex}{$2$}\right.} \) is the Frobenius norm of \( D \); and 2) \( D/{\left\Vert D\right\Vert}_{Frob}=\left[{d}_{ij} / {\left\Vert D\right\Vert}_{Frob}\right] \).
For λ > 0, the JAMMIT algorithm generates following rank1 approximation of D
by minimizing the error function
subject to the constraint
where: 1) uv^{T} ∈ ℝ^{p × n} is the outer product of u ∈ ℝ^{p} and v ∈ ℝ^{n}; 2) u is sparse relative to p, i.e., \( s\ll p \); 3) v represents a SOI on the columns of D; 4) λ > 0 is an ℓ_{1} penalty on u; and 5) \( {\left\Vert u\right\Vert}_{\ell_1}={\displaystyle {\sum}_{i=1}^p\left{u}_i\right} \) is the ℓ_{1} norm of u ∈ ℝ^{p}. Starting with an initial ℓ_{2} approximation (u^{(0)}, v^{(0)}) based on the SVD of \( D \) such that D ≈ u^{(0)}(v^{(0)})^{T}, JAMMIT first obtains a ℓ_{1} regularized solution vector u^{(1)} ∈ ℝ^{P} defined by
then substitutes this solution in (3) to obtain v^{(1)} ∈ ℝ^{n} and the solution (u^{(1)}, v^{(1)}) that satisfies D = u^{(1)}(v^{(1)})^{T}. Hence, the equality constraint in Eq. (3) ensures the outer product uv^{T} in Eq. (2) represents a rank1 approximation of D under the ℓ_{1} norm. This procedure is repeated by alternating between (2) and (3) until the sequence (u^{(i)}, v^{(i)}) converges to a solution (u, v) based on the error function given in (2) such that
Let ζ(λ) ∈ ℝ^{s} denote the MMSIG with nonzero entries that correspond to \( s=s\left(\zeta \right)>0 \) rows of D that support the sparse linear model in (5) as a function of λ. We note that: i) λ = 0 implies that (1) is the best rank1 approximation of D based on the SVD; ii) λ > 0 implies that (1) is a ℓ_{1} regularized, rank1 approximation of D such that s = dim(ζ) ≤ p; and iii) there exists λ^{sup} > 0 such that \( 0\le s\le p \) if \( \lambda \in \left(0,{\lambda}^{sup}\right) \). We show empirically that for simulated and real multimodal data, one can find λ* ∈ (0, λ^{sup}) based on an empirical estimate of FDR such that ζ(λ*) is sparse in D, i.e., s(λ*) = s* < < p.
Equation (5) suggests that parsing the vector u according to the order in which the D_{ k }’s were stacked in D results in individual rank1 approximations
where \( {u}_k\in {\mathrm{\mathbb{R}}}^{s_k} \) is unique to each D_{ k } and v represents the SOI in (1) that is shared by each D_{ k }. Eq. (6) implies that the MMSIG ζ* = ζ(λ*) = ζ*(D) can be similarly parsed into typespecific signatures ζ ^{*}_{ k } = ζ*(D_{ k }) according to the stacking order of the D_{ k }’s in D that explain v in terms of the kth data type only. Moreover, we have observed empirically that the sparsity of ζ* implies that the typespecific signatures ζ ^{*}_{ k } in D_{ k } are also sparse if D_{ k } is big. Moreover, analysis of simulated and real MMDSs show that the algorithm will still select significant rows of D_{ k } even if D_{ k } is not big. Table 1 outlines the key steps of a single iteration of the JAMMIT algorithm for computing joint rank1 approximations of each \( {D}_k \) of a given supermatrix D.
Note that JAMMIT detects and models the most dominant SOI in D and that weaker SOI of biological and/or clinical importance could be present in D that are masked by the dominant SOI. Hence, we “residualize” D by
and use JAMMIT to sparsely model the most significant SOI in D′ [18]. This procedure is iterated until no statistically significant MMSIG are detected and modeled. In any case we hypothesize that the number of iterations is bounded by \( {R}^{*}=\underset{k}{ \min}\left[ rank\left({D}_k\right)\right] \).
Selecting an ℓ_{1} penalty based on false discovery rate (FDR)
For actual experimental data, empirical FDR was used to select an ℓ_{1} penalty that results in a MMSIG of desired size and statistical significance. Briefly, FDR was estimated for a monotone increasing sequence of λ’s denoted by
such that λ_{1} = 0 results in the MMSIG provided by the SVD and λ_{ L } is the smallest λ that results in a MMSIG of length zero. The presence of statistically significant rowcorrelations between the matrices of D is indicated by a sequence of total FDR values,
that decreases rapidly as a function of increasing λ. In this case, a λ* ∈ Λ can be selected such that: a) Θ(λ*) ∈ Θ(Λ) is a local minimum that is smaller than some predetermined threshold; and b) the resulting signature, ζ* = ζ(λ*), is sparse in D. Conversely, a FDR sequence, Θ(Λ), that fails to decrease fast enough may preclude the selection of a λ* ∈ Λ that is less than a predetermined threshold and suggests a lack of support from one or more of the \( {D}_k^{\prime }s \) for the SOI. Note that a “joint” FDR sequence, Θ(Λ), can be decomposed into a collection of typespecific FDR sequences, Θ(Λ) = {Θ_{ k }(Λ)} ^{K}_{k = 1} based on the stacking order of the D_{ k }’s in D. Here, Θ_{ k }(Λ) represents the FDR sequence for the kth subsignature, ζ ^{*}_{ k } of ζ* (see Additional file 1). Again, the presence of a sparse subset of variables in D_{ k } that support the common SOI in a statistically significant way is signaled by a rapidly decreasing sequence of FDR values in Θ_{ k }(Λ), while the absence of any rowsupport is indicated by a slowly decreasing FDR sequence, Θ_{ k }(Λ), for k = 1, 2, …, K. It follows that if all \( {D}_k \) sparsely support the SOI, then all Θ_{ k }(Λ) will rapidly decrease in unison for increasing \( \lambda \). Additional file 1 provides more detail on how the FDR sequences Θ(Λ) and Θ_{ k }(Λ) were generated.
Simulated data
The detection performance of JAMMIT and other joint analysis algorithms were evaluated on 1000 simulated MMDS using Receiver Operating Characteristic (ROC) analysis (see subsection below entitled Area under the ROC curve as a function of the ℓ_{1} penalty parameter). Simulated MMDS, D^{(η)} = {D ^{(η)}_{ k } } ^{2}_{k = 1} = {(Σ ^{(η)}_{ k } + Ν ^{(η)}_{ k } )} ^{2}_{k = 1} , for η = 1, 2, …, 1000 were generated where p_{1} and p_{2} were randomly selected from P = {1000, 2000, …, 10000}. Here, Σ ^{(η)}_{ k } and Ν ^{(η)}_{ k } represent simulated signalonly and noiseonly data matrices, respectively, of dimensions p ^{(η)}_{ k } × 50 for k = 1, 2 and η = 1, 2, …, 1000. For each η, the supermatrix D^{(η)} = stack(D^{(η)}) = Σ^{(η)} + Ν^{(η)} was assembled where: 1) p^{(η)} = p ^{(η)}_{1} + p ^{(η)}_{2} ; 2) Σ^{(η)} = stack(Σ ^{(η)}_{1} , Σ ^{(η)}_{2} ); and 3) Ν^{(η)} = stack(Ν ^{(η)}_{1} , Ν ^{(η)}_{2} ).
The support of Σ ^{(η)}_{ k } in D ^{(η)}_{ k } , denoted by Supp(D ^{(η)}_{ k } ), corresponds to the nonzero components of I_{ η } = stack(I ^{(η)}_{ k } (step), I ^{(η)}_{ k } (rand)) that identify the rows of D ^{(η)}_{ k } that contain signals SS1 or SS2 defined on the 50 columns of each supermatrix D^{(η)}. Here, SS1 and SS2 represent step and random functions defined on the columns of the supermatrix D^{(η)}. The signaltonoise ratio (SNR) of D^{(η)} in decibels is given by \( SNR\left({D}^{\left(\eta \right)}\right)=10\times { \log}_{10}\left(\frac{\operatorname{var}\left({\overset{\frown }{\varSigma}}^{\left(\eta \right)}\right)}{\operatorname{var}\left({\overset{\frown }{N}}^{\left(\eta \right)}\right)}\right) \) where\( {\overset{\frown }{\varSigma}}^{\left(\eta \right)},\ {\overset{\frown }{N}}^{\left(\eta \right)}\in {\mathrm{\mathbb{R}}}^{50p} \) represent vectorized versions of Σ^{(η)} and Ν^{(η)}, respectively. The goal of each simulation is to detect Supp(D^{(η)}) such that the true positive rate is maximized for a given false positive rate over a wide range of SNR scenarios. Additional file 2 provides more detail on the generation of simulated signalonly and noiseonly data matrices, Σ ^{(η)}_{ k } and Ν ^{(η)}_{ k } , respectively, for η = 1, 2, …, 1000.
Area under the ROC curve as a function of the ℓ_{1} penalty parameter
JAMMIT analysis of a simulated stacked matrix requires the specification of an ℓ_{1} penalty parameter λ > 0 in eq. (2), which results in a signature ζ(λ) such that s = dim(ζ(λ)). We note that the regularized minimization of (2) is equivalent to the unregularized minimization of E(u, v) = ‖S − uv^{T} ‖ ^{2}_{2} constrained by ‖u(λ)‖_{1} ≤ 1/λ, where the ℓ_{1} parameter λ behaves like a threshold on the components of u(λ) ∈ ℝ^{p} such that larger values of λ result in lowerdimensional signatures [22, 31]. Hence, for a given simulated MMDS and \( \lambda >0 \), we can compute the sensitivity and specificity of JAMMIT to detect a signature in \( D \) that supports a simulated SOI in the rowspace of \( D \). Consider the monotonically increasing sequence of λ_{ k } ' s (denoted by \( \Lambda \)) defined in (8). We compute the sensitivity and specificity for each \( \lambda \in \Lambda \) and plot \( sensitivity \) (true positive rate) vs. \( 1  \mathrm{specificity} \) (i.e., false positive rate) parameterized by \( \lambda \) to generate a ROC curve. Area Under the ROC (AUROC) can then be used to quantify the ability of JAMMIT to detect the true support for a simulated signal embedded in a simulated supermatrix D. The detection performance of JAMMIT or any other detection algorithm can be compared by computing the difference between the AUROC values for JAMMIT and an alternative algorithm (ΔAUROC). A positive ΔAUROC value implies JAMMIT outperformed the alternative algorithm; otherwise the alternative algorithm outperformed JAMMIT.
Analysis of multimodal data for ovarian cancer downloaded from TCGA
Genomewide mRNA, microRNA and DNA methylation data obtained from 291 tumor samples from patients with clinical stage 3 serous ovarian cancer were downloaded from TCGA (http://cancergenome.nih.gov/). This data download resulted in three highdimensional data matrices of dimensions 16020 × 291 (mRNA), 799 × 291 (microRNA) and 15418 × 291 (DNA methylation), each of which were logtransformed, quantilenormalized, centered, and scaled by their respective Frobenius norms prior to formation of an ovarian MMDS denoted by D_{ OVCA }. Clinical metadata for each patient were also downloaded from TCGA and aligned with the columns of the supermatrix of D_{ OVCA }. These data included censored survival time, age, stage, and treatment information. Subsequent to formation of D_{ OVCA }, additional wholegenome mRNA data for tumors obtained from 99 patients with Stage 3 disease were downloaded from TCGA along with associated clinical metadata. These data were organized to form a mRNA data matrix that was used to assess the robustness of any associations with overall survival with mRNA expression patterns found in the discovery data set represented by D_{ OVCA }.
JAMMIT analysis of transcriptomic and PET imaging data for liver cancer
Twenty patients referred for surgical resection of liver tumors were prospectively recruited to participate in an institutional reviewboard approved clinical research study with written informed consent. Prior to surgery, these patients underwent liver imaging with a Philips Gemini TF64 PET/CT scanner (Philips Healthcare, Andover, Massachusetts) using 18Ffluorocholine under an investigational new drug protocol. In a previous singleinstitution clinical trial, 18Ffluorocholine, a tracer of choline phospholipid synthesis, affords PET/CT with relatively high diagnostic sensitivity for HCCs [32, 33]. Presently, less is known regarding the diagnostic utility of 18Ffluorocholine for ICCs and other subtypes of liver cancer. Regions of interest (ROI) analysis of the PET/CT images were used to generate time activity curves corresponding to: 1) the arterial pool in the descending aorta; and 2) areas of tissue within the liver that corresponded to the tumor and adjacent liver samples profiled by expression arrays. PET kinetic analysis was then applied based on a 2tissue compartment (2TC) model of 18 Ffluorocholine pharmacokinetics in liver tumor and liver tissue [34, 35]. Pharmacokinetic parameters \( {K}_1,\;{k}_2,\;{k}_3,\;{k}_4,\;{K}_1/{k}_2 \), and \( Flux \) for each 2TC model corresponding to each sample were estimated using PMOD 3.4 (PMOD Technologies, Zurich Switzerland) and assembled to form a \( 6\times 50 \) Pet kinetics data matrix for the 50 tissue samples included in the experiment.
Tumor and adjacent nontumor liver tissue specimens were obtained subsequently during surgery, and RNA was extracted from homogenized frozen tissue lysates in RLT Plus buffer with the AllPrep DNA/RNA Mini kit (Qiagen, Valencia, CA) following manufacturer’s protocol. The isolated RNA was stored at 80^{0}0 until used. The quality of the total RNAs was checked on a Bioanalyzer using RNA 6000 Nano chips (Agilent, Santa Clara, CA). The RNA samples were processed following the WGDASL assay protocol (Illumina Inc., Sunnyvale, California) and the resulting PCR products were hybridized onto the Illumina HumanHT12 v4 Expression BeadChips included over 24,000 transcripts with genomewide coverage of wellcharacterized genes, gene candidates, and splice variants. Arrays were scanned using the iScan^{TM} instrument and expression levels were quantified using GenomeStudio software (Illumina Inc., Sunnyvale, CA).
Genelevel expression values were assembled to form a 20792 × 50 data matrix where the rows represented 20792 genes and columns represented 50 adjacentnormal and tumor samples obtained from 20 patients. Here, columns 1–20 of the data matrix represented adjacentnormal samples while columns 21–50 represented 30 liver tumors of which 22 were hepatocellular carcinomas (HCCs), 6 were intrahepatic cholangiocarcinomas (ICC) and 2 were sarcomas. The data matrix was preprocessed by generalized log2 transformation with background subtraction, quantile normalization, and row centering [36].
Eigensurvival analysis
Let \( D \) be a \( p\times n \) data matrix where \( p\gg n \) and let \( \zeta (D) \) denote the s × n submatrix of \( D \) composed of rows from \( D \) that correspond to the variables (i.e., matrix rows) of a JAMMITderived signature ζ. Alternatively, the columns of ζ(D) can be viewed as “realizations” of the signature ζ in each of the n patients used to formulate \( D \). Let Ω(D) be a 2 × n survival data matrix for D where the 1st row contains observed timetodeath for the n patients of D and the 2^{nd} row is a binary indicator of censorship for each patient (0=uncensored, 1=censored). We extracted an EigenSurvival Model (ESM) based on the SVD of ζ(D) to reduce the negative impact of random noise and systematic errors on the prediction of overall survival [37, 38]. The ESM was then used to compute prognostic scores for each patient, and patients with scores in the top and bottom quartiles of scores were identified. The signature ζ(D) was predictive of survival if and only if differences in survival between patients with scores in the top and bottom quartiles were significant in both the KM and Cox regression models with pvalue of 0.05 or less. Additional file 3 provides more detail on the workflow used to extract an ESM for a given signature.
Ingenuity Pathway Analysis (IPA)
Ingenuity Pathway Analysis (IPA, QIAGEN Redwood City, California) was used to rapidly profile a given mRNA signature for enrichment in genes, canonical pathways, biological processes and upstream regulators related to cancer. In particular, IPA’s Upstream Regulator Analysis (IPA/URA) feature was used to decompose a given JAMMITderived signature into lowerdimensional subsignatures composed of genes that are targeted by a single upstream regulatory molecule. In this analysis, an upstream regulator can be a chemokine, cytokine, transcription factor, drug, etc. and IPA computes an activation score and intersection pvalue for the targeted subset of genes. The activation score measures the consistency between the observed effect of the predicted regulator on the targeted variables in our data and the predicted effect based on current knowledge as encoded in IPA. The intersection pvalue measures the probability of a chance association between the predicted upstream regulator and its downstream targets that reside in a given signature. Note that a predicted upstream regulator does not have to be a member of the signature. Activation scores greater than 2.0 and pvalues less than 1.0E03 are considered significant. Signatures that are “anchored upstream” in this way inherit the function of this regulator and are thus easier to interpret biologically. IPA also generates hypotheses regarding the genes and pathways that may explain the downstream effects of a given signature on biological and disease processes.
Results and discussion
JAMMIT performance on simulated data
The effectiveness of JAMMIT to detect multiple signals in simulated data sets was evaluated and compared to other algorithms such the JIVE and PLS. JIVE is a generalization of Principal Components Analysis (PCA) to multiple data matrices. Like JAMMIT, PLS enables the supervised analysis of one matrix by another matrix and is also used for the analysis highdimensional data sets [24]. All three algorithms were applied to the same collection of 1000 simulated MDS’s (see Methods section, Simulated Data) and tasked to detect two sparsely supported signals, SSig1 and SSig2, that were embedded in the data matrices of each simulation over a widerange of SNR scenarios. SSig1 represents a noisy signal for differential expression that distinguishes the first 25 samples of the simulation from the last 25 samples. SSig2 on the other hand represents a random signal that is sparsely supported by rows in both data matrices that represents an unmeasured and/or unknown biological attribute of the samples.
The goal of each simulation is to detect the sparse support of SSig1 and SSig2 in each simulated data matrix. Figure 2 shows distributions of ΔAUROC values that compares the ability of JAMMIT to detect the support of SSig1 and SSig2 versus that of JIVE and PLS in 1000 data simulations. For example, the first row of plots shows that the distributions of ΔAUROC values for SSig1 and SSig2 are concentrated on the positive real axis. This means that the AUROC values for JAMMIT exceeded that of JIVE more frequently than not for SSig1 and SSig2, with pvalues of 4.33.E15 and 1.99E73, respectively. Similarly, the second row of plots shows that the area under the ΔAUROC distributions for both SSig1 and SSig2 is concentrated on the positive real numbers indicating that JAMMIT outperformed PLS significantly more often than not over 1000 data simulations with pvalues of 1.68E10 and 6.39E61, respectively. Hence, relative to JIVE and PLS, we see that JAMMIT compares favorably in terms of ability to detect the sparse support of a step and random signal in multiple, highdimensional data sets.
JAMMIT analysis of ovarian cancer data from TCGA
A MMDS composed of global mRNA, microRNA and DNA methylation data obtained from 291 ovarian tumors resected from patients with stage 3 disease were downloaded from TCGA and jointly analyzed using JAMMIT. The goal was to determine if MMSIG exist that distinguished subtypes of ovarian cancer that lead to different clinical outcomes. LeaveOneOut CrossValidation (LOOCV) based on JAMMIT was applied to D to identify a MMSIG for ovarian cancer that was robust to minusone perturbations of the 291sample discovery data set. First, a sequence of FDR values for a monotonically increasing sequence of ℓ_{1} penalty values was computed based on the JAMMIT analysis of 100 permuted versions of the supermatrix, D (see Methods section). An ℓ_{1} penalty parameter of λ_{291} = 0.002875.was selected based on an FDR of 0.0034619 that was a local minimum, which resulted in an mRNA signature ζ ^{(0)}_{ mRNA } composed of 643 genes, a miRNA signature ζ ^{(0)}_{ miRNA } composed of 368 microRNAs (FDR= 0.19912), a methylation signature ζ ^{(0)}_{ Meth } composed of 450 methylation loci (FDR = 0.03038), and a MMSIG ζ^{(0)} composed of 1461 mRNA, miRNA and methylation variables that were “stacked” in the order of the D_{ k }’s in D ( FDR = 0.067647) (see Additional file 4).
For the LOOCV analysis, the jth column of each D_{ k } of D was removed to obtain minusone MMDSs, D^{(j)} = {D ^{(j)}_{ k } } ^{3}_{k = 1} , and minusone stacks, D^{(j)} = stack(D^{(j)}) for j = 1, 2, …, 291. JAMMIT was then applied to each D^{(j)} with λ_{291} = 0.002875, which resulted in s_{ j } dimensional, minusone MMSIGs, ζ^{(j)}, for j = 1, 2, …, 291. On average, each ζ^{(j)} recapitulated 98 % of the s_{0} variables of ζ^{(0)} over all 291 minusone analyses implying that JAMMITderived signatures based on λ = λ_{291} are robust to minusone perturbations of the discovery data set. A single MMSIG defined by \( \zeta =\underset{j}{\cap}\kern0.2em {\zeta}^{(j)} \) was generated, which defined subsignatures composed of 534 mRNAs (ζ_{1}), 337 microRNAs (ζ_{2}) and 357 methylation loci (ζ_{3}) common to all 291 minusone MMSIGs.
Each typespecific signature obtained by JAMMIT was analyzed individually and in various combinations using hierarchical cluster analysis to identify “metagenes”, i.e., subsets of variables that exhibited coordinated, lowfrequency variation of expression over the 291 samples of the discovery data set. Such coherent variation offers the best opportunity to identify novel, lowdimensional signatures that capture important biological and/or clinical attributes of the tumor samples. Figure 3 shows hierarchically clustered heatmaps of the three typespecific signatures, ζ_{1}, ζ_{2}, and ζ_{3}, for mRNA, microRNA and methylation, respectively, and a MMSIG, ζ_{13}, that “stacked’ the mRNA and methylation signature. Here, the subscript “13” denotes the concatenation of the mRNA (1) and methylation (3) signatures derived by JAMMIT. This particular combination was chosen because the FDR values for ζ ^{(0)}_{1} and ζ ^{(0)}_{3} were highly significant compared to ζ ^{(0)}_{2} , which implied the typespecific signatures ζ_{1} and ζ_{3} best explained the common SOI shared by all three different data types. Visual examination of Fig. 3ac shows that the clustered heatmaps for each typespecific signature contained metavariables composed of matrix rows that exhibited coordinated patterns of variation, some of which are highlighted in yellow or green. In particular, the clustered heatmap for ζ_{13} in Fig. 3d contained the metagene, γ, (highlighted in green) that defined a MMSIG composed of 249 variables of which 209 were mRNAs (γ_{1}), and 40 were methylation loci (γ_{3}). Figure 4 shows that the MMSIG, γ, and the typespecific subsignatures, γ_{1}, and γ_{3} were all significantly associated with overall survival on the 291 discovery samples contained in S_{ n }. Interestingly, the signature that combined the mRNA and methylation variables had a more significant association with survival than signatures that contained only mRNA or only methylation variables based on logrank and Cox regression pvalues, median survival time, and 5year survival rate.
To further reduce signature dimensionality and to better understand the biology that underlay the association of γ with overall survival, we focused subsequent downstream analysis and interpretation on the 209gene mRNA signature, γ_{1}, using IPA. In particular, the Upstream Regulator Analysis (URA) feature in IPA was used to identify subsignatures of γ_{1} that were “anchored” upstream by a single regulating molecule. Table 2 shows that Interleukin 4 (IL4) was the top upstream regulator of γ_{1} that directly targeted 40 genes (out of 209) in the signature (Score=2.115 p=2.11E20). Note that activation scores greater than 2.0 and pvalues less than 1.0E03 are considered significant. The 40 genes in γ_{1} directly targeted by IL4 were used to define a mRNA signature φ ^{(40)}_{IL4} contained in γ_{1} that was “anchored” upstream by IL4. Figure 5 shows the results of an eigensurvival analysis based on the realization of φ ^{(40)}_{IL4} in the expression data for the 291 patients in the discovery data set. Figure 5a shows the clustered heatmatp of φ ^{(40)}_{IL4} realized in the training data set and Fig. 5b shows KM plots based on prognostic scores for each patient derived from the ESM extracted from the expression patterns in Fig. 5a. In Fig. 5b, we see that 144 patients with prognostic scores in the top and bottom quartiles have significantly different KM plots with logrank pvalue of 3.89E06 (logrankP). Moreover, a Cox regression model of overall survival based on prognostic scores for all 291 patients with age as a covariate had a pvalue of 1.68E07 (CoxP), which provides further validation of the eigensurvival model derived from expression patterns visualized in Fig. 5a. Figure 5c shows the clustered heatmap of the φ ^{(40)}_{IL4} signature realized in wholegenome mRNA data for 99 independent test tumor samples. The prognostic scores for the 99 test patients were computed by processing the expression patterns in Fig. 5c using the ESM derived from the expression patterns in Fig. 5a. Figure 5d shows that test patients with prognostic scores in the top and bottom quartiles have significantly different survival statistics (logrankP=2.08E03, CoxP=1.26E03). Hence, the ESM based on φ ^{(40)}_{IL4} captured information related to overall survival that was also applicable to the 99samples of the independent test data set that were unseen during discovery.
We further reduced the dimensionality of φ ^{(40)}_{IL4} based on the ESM extracted from the 291 discovery samples. Figure 6 shows a plot of the 40 loading coefficients associated with the ESM derived from expression patterns in Fig. 5a with 12 high magnitude coefficients highlighted in red. The 12 genes corresponding to these coefficients were assembled to form the mRNA signature, φ ^{(12)}_{IL4} , that was tested for association with overall survival on the 291sample discovery data set and the 99sample independent test data set. Figure 7a shows that ESM based on φ ^{(12)}_{IL4} in the 291 samples of the discovery data set was significantly associated with overall survival (logrankP=1.54E05, CoxP=1.06E04). Moreover, Fig. 7b shows that the ESM based on φ ^{(12)}_{IL4} realized in the discovery data generalizes to the 99 samples of the independent test data set (logrankP=9.70E03, CoxP=4.64E04). Interestingly, the set of 28 genes in φ ^{(40)}_{IL4} complementary to the genes in φ ^{(12)}_{IL4} failed to generalize on the 99 independent test samples. These results validate the BEST principle as implemented by JAMMIT for the joint analysis of multiple data sets in ovarian cancer.
Note that IL4 directly targets every gene in φ ^{(12)}_{IL4} per IPA. IL4 induces the transformation of Tumor Associated Macrophages (TAMs) that infiltrate the tumor microenvironment into the M2 phenotype, which confers a survival advantage to cancer cells and promotes tumor growth [39, 40]. An alternative pathway involving Interferon Gamma (IFNG) and Tumor Necrosis Factor Alpha (TNFA) transform TAMs into the M1 phenotype that exerts a cytotoxic effect on genetically mutated cancer cells. It has been reported that a high M1/M2 ratio is associated with extended survival in ovarian cancer patients [39]. This suggests that immune cell polarization in the tumor microenvironment impacts the overall survival of patients with ovarian cancer undergoing standard platinumbased chemotherapy combined with paclitaxel. Indeed, the φ ^{(12)}_{IL4} signature contains the Chemokine (CC motif) Ligand 2 (CCL2) gene, which is a chemokine that recruits monocytes from the bloodstream to the tumor microenvironment [41]. It has been reported that CCL2 is upregulated in ovarian cancer and the blockade of CCL2 protein expression enhances immunotherapeutic and chemotherapeutic response [41].
Imaginggenomics of liver cancer
Wholegenome expression data were collected for 20792 genes in 20 adjacentnormal, 22 hepatocellular carcinoma (HCC), 6 intrahepatic cholangiocarcinoma (ICC) and 2 sarcoma samples using DASL microarrays. The expression data were assembled to form a 20792 × 50 expression data matrix where columns 1–20 represented the normal samples and columns 21–50 represented the tumor samples. The data matrix of raw expression was preprocessed by generalized log2 transformation, quantile normalization, and rowcentering to obtain the preprocessed expression data matrix H_{ mRNA }. The values of six kinetic parameters, \( {K}_1,\;{k}_2,\;{k}_3,\;{k}_4,\;{K}_1/{k}_2,\; Flux \) obtained from 2TC models for each tissue sample formed the columns of a 6 × 50 data matrix that was rowcentered to obtain the PET data matrix, H_{ PET }. A final preprocessing step involved the scaling of the stacked matrix H_{ PETX } = stack(H_{ mRNA }, H_{ PET }) by its Frobenius norm. The goal of this analysis is to identify mRNA signatures that are highly correlated with the rows of the PET kinetic data matrix [42, 43].
Six different analyses of H_{ mRNA } based on JAMMIT were conducted where each analysis was supervised by a single PET kinetic parameter. That is, JAMMIT was applied to H ^{(l)}_{ PETX } = {H_{ mRNA }, H ^{(l)}_{ PET } } where H ^{(l)}_{ PETX } is a 1dimensional vector equal to the \( l \) th row of H_{ PET } for l = 1, 2, …, 6. Of the six possible analyses, only supervision by the \( {H}_{PETX}^{(5)}={K}_1/{k}_2 \) kinetic parameter resulted in a FDR profile that implied significant joint correlations between H_{ mRNA } and H_{ PET } (see Additional file 5). A locally minimal FDR* = 0.000549 was selected from the FDR profile for genes that corresponded to an \( {\ell}_1 \) penalty parameter value of λ* = 0.0089429. A JAMMIT analysis based on this value of λ resulted in a mRNA signature \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) containing 652 genes that was significantly correlated with the \( {K}_1/{k}_2 \) kinetic parameter. Persistently low FDR values for \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) as a function of λ implied a significant and robust correlation between \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) and the \( {K}_1/{k}_2 \) PET parameter over a widerange of sparse, linear models. Moreover, the dominant eigensignal of the 652 × 50 signature matrix, \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)}\left({H}_{mRNA}\right) \) was significantly correlated with the \( {K}_1/{k}_2 \) PET parameter (\( r=0.413,\;p=0.00293\Big) \). In sharp contrast, the FDR profiles for JAMMIT analyses of H_{ mRNA } supervised by the other PET kinetic parameters failed to produce an ℓ_{1} penalty that correlated the two data types (see Additional file 6). Note these results show that JAMMIT is able to identify significant variables of data types defined by a small number of variables. Indeed, the data matrix H_{ mRNA } described above has 20792 rows, while the PET kinetic data matrix, \( {H}_{PETX}^{(5)} \), has a single row composed of \( {K}_1/{k}_2 \) kinetic parameter values in 50 samples. Here, the FDR table for the joint analysis of H_{ mRNA } and \( {H}_{PETX}^{(5)} \) admits the single row of \( {H}_{PETX}^{(5)} \) into the sparse, rank1 approximation of D ^{(l)}_{ PETX } = stack{H_{ mRNA }, H ^{(l)}_{ PET } } for almost all ℓ_{1} parameter values (see Additional files 5 and 6).
Figure 8 visualizes the realization of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) in H_{ mRNA } as a rowclustered heatmap where we see that aggregate gene expression is highly variable on the tumor samples (columns 21–50) compared to the normal samples (columns 1–20). Figure 9a shows a 2way clustered heatmap of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) and here we see a group of genes in \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) that are preferentially downregulated on a set of 15 tumors relative to a complementary subset of fifteen (15) HCCs and twenty (20) normal samples. Let Γ^{(−)} denote the set of column indices of H_{ mRNA } that correspond to the samples where \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) is downregulated and Γ^{(+)} column indices for samples where \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) is upregulated. In Fig. 9b we see that the dominant eigensignal of the 2way, clustered heatmap in Fig. 9a clearly discriminates between the samples in Γ^{(−)} and Γ^{(+)} based on a threshold set at zero. The ability of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) to discriminate between the samples in Γ^{(−)} and Γ^{(+)} suggests two distinct expression phenotypes for HCC represented by the seven (7) HCC in Γ^{(−)} and fifteen (15) HCC in Γ^{(+)}. Moreover, the coclustering of 7 HCC samples in Γ^{(−)} along with 6 ICC suggests that these HCC samples represent a cholangiolike HCC subtype (CLHCC), which may share clinical and biological attributes of this more aggressive subtype of liver cancer [44, 45].
Table 3 lists the top canonical pathways and upstream regulators of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) according to IPA. The top upstream regulators included the nuclear receptors HNF4A, HNF1A, and FXR (NR1H4) where HNF4A and HNF1A were predicted to be inactivated with high statistical significance. Moreover, FXR/LXR and LXR/RXR Activation were the top canonical pathways and most of the genes in both pathways were downregulated suggesting inactivation of these pathways upstream of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \). The dominate downstream effects of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) per IPA included biological functions related to the dysregulation of lipid and bile acid metabolism as well as disease functions related to the initiation and progression of HCC and ICC. For example, the inactivation of HNF4A as a significant upstream regulator of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) is consistent with published reports that HNF4A downregulation suppresses hepatocyte differentiation and commitment to the biliary lineage in ICC and CLHCC [44–47]. Moreover, loss of HNF1A function in hepatocytes leads to the activation of pathways involved in tumorigenesis [48]. Finally, HNF4A and FXR exhibit reduced expression in human HCC and ICC, and that mice lacking FXR expression spontaneously developed HCC [49–51].
We note the \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) signature included 46 membrane transport genes from the ATPBinding Cassette (ABC) and Solute Carrier (SLC) superfamilies, almost all of which were significantly downregulated in the tumor samples of Γ^{(−)} relative to the samples in Γ^{(+)}. Recall the dominant eigensignal of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)}\left({D}_1\right) \) was found to be significantly correlated with the K_{1}/k_{2} PET parameter (\( r=0.413,\;p=0.00293\Big) \). The K_{1}/k_{2} parameter influorocholine PET images reflects the bloodtissue equilibrium of choline, a nutrient important for phospholipid and bile homeostasis, as well as lipid transform. Therefore, it is not surprising that the \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) signature contained a significant number of ABC and SLC membrane transport genes, since these genes regulate the influx and efflux of bile and lipids across the membranes of hepatocytes and cholangiocytes and are tightly regulated by nuclear receptors HNF4A, HNF1A and FXR [52]. The above suggests the inactivation of HNF4A, HNF1A and FXR upstream of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) suppresses the uptake and efflux of bile and lipids downstream of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) by downregulating the expression of specific ABC and SLC genes of \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \). In addition to the widespread disruption of bile acid and lipid homeostasis, the downregulation of membrane transporters in \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) directly impacts liver carcinogenesis and tumor progression. For example: i) SLC22A1 is associated with progression and survival in human ICC [53]; ii) knockout mice lacking ABCB4 suffer from the loss of biliary phospholipid secretion and spontaneously develop HCC [50]; iii) transporter genes ABCB1, ABCC6, ABCC9, ABCG2 are downregulated in prostate cancer [54]; iv) ABCB11/BSEP (Bile Salt Export Pump) and FXR expression is reduced in HCC [55]; and v) SLC22A1 is epigenetically silenced in human HCC [56].
Figure 10 shows the expression profiles of the ABCB11 gene (i.e., Bile Salt Export Pump or BSEP), in two different groupings of the samples: i) ICCvsHCC compares 6 ICC (columns 1–6) and 22 HCC (columns 7–28); and ii) NRMvsTUMOR compares 20 Normals (columns 1–20) and 30 Tumors (columns 21–50). The top panel of Fig. 10 shows that the ABCB11 gene is downregulated in the ICC samples (red squares) and CLHCC samples (green triangles) relative to the HCC samples (blue circles) in the ICCvsHCC data set based on a horizontal threshold set at zero. The bottom panel of Fig. 10 shows that ABCB11 is uniformly upregulated on the 20 normals and highly variable on the tumors with preferential downregulated on the ICC (red circles), CLHCC (green triangles) and sarcoma samples in the NRMvsTUMOR data set. The ABCB11 gene codes for a protein that facilitates the efflux of bile acids out of the liver and defects in the ABCB11 gene result in progressive familial intrahepatic cholestasis, which is a progressive liver disease that often starts early in life and rapidly progresses to endstage liver disease with an increased risk for HCC. The above suggests that ICC and CLHCC subtypes can be characterized in part by the suppression of bile acid efflux that is mediated by the downregulation of the ABCB11 transporter gene.
Figure 11 shows the expression profiles of nuclear receptors FXR and HNF4A and the SLC transporter genes SLC2A1/GLUT1 and SLC6A14 in the ICCvsHCC and NRMvsTUMOR experiments. Panels A and B of Fig. 11 confirm that both FXR and HNF4A are preferentially downregulated in ICCs relative to the HCC, uniformly upregulated on the normals relative to liver tumors, and highly variable on the tumors with preferential downregulation on the tumors in Γ^{(−)}. Panel C of Fig. 11 shows that unlike the nuclear receptors FXR and HNF4A, the SLC2A1/GLUT1 transporter is upregulated in ICC relative to HCC, uniformly downregulated on normals relative to tumors, and highly variable on tumors but with preferential upregulation on the tumors in Γ^{(−)}. In Fig. 11d, SLC6A14 shows strikingly high and specific upregulation on all 6 ICC and 5 of 7 CLHCC samples relative to the remaining 15 HCC samples in the ICCvsHCC experimental. Moreover, we see that SLC6A14 is uniformly downregulated on the normals compared to the tumors in NRMvsTUMOR with significant upregulation concentrated on the ICC and CLHCC samples. SLC6A14 is reported to be highly activated in cancers of the colon, cervix, breast, and pancreas, and the blockade of SLC6A14 has been suggested as a treatment for many solid tumors [57, 58]. The expression profiles in Fig. 11d supports the possibility that SLC6A14 may be a therapeutic target ICC and CLHCC.
The correlation between \( {\omega}_{mRNA}^{\left({K}_1/{k}_2\right)} \) and the K_{1}/k_{2} PET parameter suggests the expression phenotypes represented by Γ^{(−)} and Γ^{(+)} can be distinguished by the K_{1}/k_{2} parameter [42, 59, 60]. To test this hypothesis, we encoded the information content of the K_{1}/k_{2} parameter vector in a Generalized Regression Neural Network (GRNN) implemented in MATLAB (The MathWorks Inc., Natick, MA) after denoising by the Daubechies mother wavelet of order 3 over 5 scales [61–63]. The GRNN model was trained using a ‘spread” parameter set at 0.23235 that defines the level of smoothing of the GRNN output. Training of the GRNN was supervised by a binary target vector, T ∈ {0, 1}^{50}, where the samples in Γ^{(+)} and Γ^{(−)} were labeled with a “0” and “1”, respectively. Figure 12a visualizes the output of a GRNN trained on the K_{1}/k_{2} parameter for the 50 samples included in this study. Samples of the expression phenotype Γ^{(−)} are highlighted by red squares (ICC), green triangles (CLHCC) and black asterisks (sarcoma) while the samples in Γ^{(+)} (adjacentnormal and HCC) are highlighted as blue circles. The horizontal threshold (magenta line) was used to classify each of the 50 samples by assigning a sample to the Γ^{(−)} phenotype if its GRNN value was greater than the threshold and to Γ^{(+)} otherwise. Here, we see the GRNN trained on the denoised K_{1}/k_{2} vector correctly classified all the samples in Γ^{(−)} and 71 % of the samples in Γ^{(+)} for an average correct classification rate of 86 %, which is significantly greater than chance. We note that the GRNN output vector was significantly correlated with the target values in T (r = 0.61267 p = 1.987E − 06). To assess the robustness of this result, the K_{1}/k_{2} parameter vector was randomly permuted 1000 times and a GRNN was trained on each permutation using the target vector T and spread parameter equal to 0.23235. Figure 12b shows that it is difficult to separate Γ^{(−)} and Γ^{(+)} using any threshold on the output of a GRNN trained on a random permutation of the K_{1}/k_{2} parameter vector, which is reflected in the low correlation of the GRNN output with the target vector T \( \left(r=0.27615,\;p=0.05223\right) \). Out of 1000 permutations only one had correlation greater than r = 0.61, which resulted in an empirical pvalue of \( {p}_{K_1/{k}_2}=1/1000=0.001 \). Hence, the observed separation of Γ^{(−)} and Γ^{(+)} shown in Fig. 12a was probably not a random event.
These preliminary results suggest that the noninvasive monitoring of specific biological processes over time in liver tumors using PET imaging is possible. Note the K_{1}/k_{2} kinetic parameter is just one of many quantitative features that can be extracted from PET images for the supervised analysis of genomic data sets. Relating predictive signatures extracted from molecular images to global patterns of genomic, transcriptomic, epigenomic and metabolomic variation using algorithms such as JAMMIT can be referred to as “imaging genomics” [42, 64]. The central hypothesis of imaging genomics is that image features that capture variation over space and time reflect underlying genetic programs of biological and clinical relevance.
Discussion and conclusions
We have demonstrated that if the support of a dominant SOI of a big MMDS is supported by a small percentage of all measured variables, then ℓ_{1} regularization provides an efficient and powerful way to identify this sparse signature. We encoded this approach in the Joint Analysis of Many Matrices by ITeration (JAMMIT) algorithm that estimates a sparse signal model using an implementation of the LASSO that regularizes the best rank1 matrix approximation of the supermatrix that vertically “stacks” the individual data matrices of a MMDS based on the ℓ_{1} norm. By unstacking the supersignature derived by JAMMIT we obtain typespecific signatures that characterize clinically important attributes of the samples in terms of variables of one or more data types. JAMMIT compared favorably with other joint analysis algorithms in the detection of multiple SOI embedded in simulated MMDS over a wide range of SNR scenarios. Application of JAMMIT to ovarian cancer from TCGA resulted in novel, lowdimensional signatures that linked overall survival to host immune response and macrophage polarization in the tumor microenvironment. We also demonstrated that multimodal signatures composed of mRNA and methylation variables can result in predictive models of overall survival that outperform models based on unimodal signatures composed of only mRNA or DNA methylation variables alone. Finally, JAMMIT analysis of wholegenome mRNA and PET imaging data for liver cancer revealed a novel subtype of HCC with an expression signature similar to that of ICC, a tumor subtype with a much poorer clinical outcome. Pathway analysis indicated that this expression signature was associated with a pervasive downregulation of genes and pathways that regulated membrane transport of lipids, suggesting that any difference in clinical outcome between these two tumor subtypes may be due in part to membrane transport dysregulation. This particular application of JAMMIT to liver cancer also demonstrates how the analysis of a single big data matrix can be supervised by an arbitrary univariate function using ℓ_{1} regularization.
In developing the JAMMIT algorithm we encountered a number of technical issues related to the joint analysis of multiple data types that will require further study. For example, we have shown that ℓ_{1} regularization of the supermatrix that vertically stacks multiple, big data matrices of a MMDS for ovarian cancer resulted in lowdimensional, multimodal signatures that were biologically coherent and predictive of clinical outcomes. For this analysis, each data matrix was appropriately preprocessed as a function of data type, and the resulting supermatrix was scaled by its Frobenius norm. The sensitivity of JAMMITderived signatures to this frontend preprocessing procedure is an open question that will be answered more definitively in future studies. Another issue pertains to systematic variation in the data that we assume is unique to a given data type. Since JAMMIT models a dominant source of common variation that is shared across multiple data types, we expect the FDR profiles of each data type to rapidly decrease in unison as a function of increasing ℓ_{1} penalty for such a signal.. In this case, it is unlikely that the resulting signal model represents systematic variation that is by definition unique to a single data type. Alternatively, if only a single data type shows a rapidly decreasing FDR profile, then it is likely that JAMMIT is modeling a source of systematic variation that is unique to that data type. Subsequent downstream processing of the resulting typespecific signatures using pathway and ontological analysis should be able to resolve some of the ambiguity regarding the biological and/or clinical relevance of such signatures. This feature of JAMMIT to discriminate between systematic and biologically relevant sources of variation based on FDR decay will be characterized more fully in future investigations. Finally, the use of FDR to select an appropriate ℓ_{1} penalty that balances statistical significance and signature size provides researchers with considerable flexibility in model selection, but it comes with a high computational cost associated with permutation testing. Future studies should consider alternative methods of selecting an “optimal” ℓ_{1} penalty that takes into account user preferences for model parsimony, sensitivity, and specificity without the need for resampling.
This study illustrates the importance of taking a sequential approach to data reduction that incorporates biological knowledge in a computational model at the appropriate time to enable robust predictions in larger populations. For example, the use of prior biological knowledge encoded in IPA to “decompose” a given JAMMIT signature into smaller subsignatures based on significant upstream regulators was shown to result in lowdimensional signatures of clinical significance that facilitated downstream biological interpretation and validation. In general, the reduction of big, multimodal data sets to lowdimensional signatures that accurately model the clinical trajectory of cancer and other complex diseases can be realized by incorporating biological knowledge at appropriate points in the modeling process where algorithms such as JAMMIT represent just the first step of the data reduction process.
Abbreviations
 2TC model:

2Tissue Compartmental model
 AUROC:

Area Under the ROC
 BEST:

Bet on Sparsity Principle
 CCA:

Canonical Correlation Analysis
 ESM:

EigenSurvival Model
 FDR:

False Discovery Rate
 GVSD:

Generalized Singular Value Decomposition
 HCC:

HepatoCellular Carcinoma
 ICC:

Intrahepatic CholangioCarcinoma
 IPA:

Ingenuity Pathway Analysis
 JAMMIT:

Joint Analysis of Many Matrices by ITeration
 JIVE:

Joint and Individual Variation Explained
 LASSO:

Least Absolute Shrinkage and Selection Operator
 LOOCV:

LeaveOneOut CrossValidation)
 MMDS:

MutiModal Data Set
 MMSIG:

MultiModal Signature
 mRNA:

messenger RNA
 PET/CT:

Positron Emission Tomography/Computed Tomography)
 PLS:

Partial Least Squares
 ROC:

Receiver Operater Characteristic
 SNR:

SignaltoNoise Ratio
 SOI:

Signal of Interest
 TCGA:

The Cancer Genome Atlas
References
Donoho DL. HighDimensional Data Analysis: The Curses and Blessings of Dimensionality. Lecture Delivered at the “Mathematical Challenges of the 21st Century” Conference of the American Math. Los Angeles: Society; 2000. http://wwwstat.stanford.edu/donoho/Lectures/AMS2000/AMS2000.html.
Kristensen V, Lingjcerde O, Russnes H, Vollan H, Frigessi A, BorresenDale AL. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014;14:299–313.
Network TCGA. Integrated genomic analyses of ovarian carcinoma. Nature. 2011;474(7353):609–15.
Tomczak K, Czerwinska P, Wiznerowicz M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol. 2015;19(1A):A68–77.
Storey J, Tibshirani R. Statistical significance for genomewide studies. PNAS. 2003;100(16):9440–5.
Efron B, Hastie T, Johnstone I, Tibhshirani R. Least angle regression. Ann Stat. 2004;32:407–99.
Hamid JS, Hu P, Roslin NM, Ling V, Greenwood CMT, Beyene J. Data Integration in Genetics and Genomics: Methods and Challenges. Human Genomics and Proteomics : HGP. 2009;2009:869093. doi:10.4061/2009/869093.
ICGC. International network of cancer genome projects. Nature. 2010;464:993–8.
Zhu Y, Qiu P, Ji Y. TCGAAssembler: opensource software for retrieving and processing TCGA data. Nature. 2014;11(6):599–600.
Du P, Zhang X, Huang C, Jafari N, Kibbe W, Hou L, Lin S. Comparision of Betavalue and Mvalue methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010;11:587.
Quackenbush J. Microarray data normalization and transformation. Nat Genet Supplement. 2002;32:496–501.
Friedland S. A new approach to generalized singular value decomposition. SIAM J Matrix Anal Appl. 2005;27(2):434–44.
Lock E, Hoadley K, Marron J, Nobel A. Joint and Individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013;7(1):523–42.
Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Brown P. . “Gene shaving” as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology. 2000;1(2):research0003.1–research0003.21.
West M. Bayesian factor regression models in the “large p, small n” paradigm. Bayesian Stat. 2003;7:722–32.
Kalman D. A singularly valuable decomposition: The SVD of a matrix. Coll Math J. 1996;27(1):2–23.
Strang G. Linear Algebra and Its Applications, 4th edn: Thomson Higher Education; 2006.
Zhang T, Golub G. Rankone approximation to high order tensors. SIAM J Matrix Anal Appl. 2001;23(2):534–50.
Tibhshirani R. In praise of sparsity and convexity. 50th Anniversary volume for COPSS. 2013.
Bishop C. Pattern Recognition and Machine Learning. New York: Springer; 2007.
Jolliffe I, Trendafilov N, Uddin M. A modified principal component technique based on the LASSO. J Comput Graph Stat. 2003;12(3):531–47.
Tibshirani R. Regression shrinkage and selection via the LASSO: A retrospective. J R Stat Soc Ser B. 2011;39:1335–71.
Van Deun K, Van Mechelen I, Thorrez L, Schouteden M, De Moor B, van der Werf MJ, De Lathauwer L, Smilde AK, Kiers HA. DISCOSCA and properly applied GSVD as swinging methods to find common and distinctive processes. PloS one. 2012;7(5):e37840.
Boulesteix A, Strimmer K. Partial least squares: a versatile tool for the analysis of highdimensional genomic data. Brief Bioinform. 2006;8(1):32–44.
Alter O, Brown P, Botstein D. Generalized singular value decomposition for comparative analysis of genomescale expression data sets from two different organisms. PNAS. 2003;100:3351–6.
Shen H, Huang J. Sparse principal component analysis via regularized low rank matrix approximation. J Multivar Anal. 2008;99:1015–34.
Sabatti C, Karsten S, Geschwind D. Thresholding rules for recovering a sparse signal from microarray experiments. Math Biosci. 2002;176:17–34.
Chun H, Keles S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J R Stat Soc Ser B. 2010;72(1):3–25.
Witten D, Tibshirani R, Hastie T. A penalized matrix decomposition with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–34.
Zhang L, Liu C, Zhou X. Identifying multilayer gene regulatory modules from multidimensional genomic data. Bioinformatics. 2012;28(19):2458–66.
Hastie T, Tibhshirani R, Friedman J. The Elements of Statistical Learning. 2001.
Bieze M, Klumpen H, Verheij J, Beuers U, Phoa S, van Gulik T, Bennink R. Diagnostic accuracy of (18)Fmethylcholine positron emission tomogrpahy/computed tomography for intra and extrahepatic hepatocellular carcinoma. Hepatology. 2014;59(3):996–1006.
Talbot J, Fartoux L, Balogova S, Nataf V, Kerrou K, Gutman F, Huchet V, Ancel D, Grange J, Rosmorduc O. Detection of hepatocellular carcinoma with PET/CT: a prospective comparison of 18 Ffluorocholine and 18 FFDG in patients with cirrhosis or chronic liver disease. J Nucl Med. 2010;51(11):1699–706.
Bentourkia M, Zaidr H. Tracer kinetic modeling in PET. PET Clin. 2007;2(2):267–77.
Watabe H, Ikoma Y, Kimura Y, Nakagawa M, Shidahara M. PET kinetic analysis  compartmental model. Ann Nucl Med. 2006;20(9):583–8.
Lin SM, Du P, Huber W, Kibbe WA. Modelbased variancestabilizing transformation for Illumina microarray data. Nucleic Acids Res. 2008;36(2):e11.
Bair E, Tibshirani R. Semisupervised methods to predict patient survival from gene expression data. PLoS Biol. 2004;2(4):E108.
Shen Y, Huang S. Improve survival prediction using principal components of gene expression data. Genomics Proteomics Bioinformatics. 2006;4(2):110–9.
Zhang M, He Y, Sun X, Li Q, Wang W, Zhao A, Di W. A high M1/M2 ratio of tumorassociated macrophages is associated with extended survival in ovarian cancer patients. J Ovarian Res. 2014;7:19.
Solinas G, Germano G, Mantovani A, Allavena P. Tumorassociated macrophages (TAM) as major players of the cancerrelated inflammation. J Leukoc Biol. 2009;86(5):1065–73.
Moisan F, Francisco E, Brozovic A, Duran G, Wang Y, Chaturvedi S, Seetharam S, Snyder L, Doshi P, Sikic B. Enhancement of paclitaxel and carboplatin therapies by CCL2 blockade in ovarian cancers. Mol Oncol. 2014;8:1231–9.
Gillies R, Anderson A, Gatenby R, Morse D. The biology underlying molecular imaging in oncology: From genome to anatome and back again. Clin Radiol. 2010;65(7):517–21.
Segal E, Sirlin C, Ooi C, Adler A, Gollub J, Chen X, Chan B, Matcuk G, Barry C, Chang H, et al. Decoding gobal gene expression programs in liver cancer by noninvasive imaging. Nat Biotechnol. 2007;25(6):675–80.
Coulouarn C, Cavard C, RubblaBrandt L, Audenbourg A, Dumont F, Jacques S, Just PA, Clement B, Gilgenkrantz H, Perret C, et al. Combined hepatocellularcholangiocarcinomas exhibit progenitor features and activation of wnt and TGFB signaling pathways. Carcinogenesis. 2012;33(9):1791–6.
Woo H, Lee J, Kim C, Lee H, Jang J, Yi N, Suh K, Lee K, Park E, Thorgeirsson S, et al. Identification of a cholangiocarcinomalike gene expression trait in hepatocellular carcinoma. Cancer Res. 2010;70(8):3034–41.
Walesky C, Apte U. Role of hepatocyte nuclear factor 4 alpha (HNF4A) in cell proliferation and cancer. Gene Expr. 2015;16(3):101–8.
Walesky C, Edwards G, Borude P, Gunewardena S, O'Neil M, Yoo B, Apte U. Hepatocyte nuclear factor 4 alpha deletion promotes diethylnitrosamineinduced hepatocellular carcinoma in mice. Hepatology. 2013;57(6):2480–90.
Pelletier L, Rebouissou S, Paris A, RathahaoParis E, Perdu E, BioulacSage P, Imbeaud S, ZucmanRossi J. Loss of hepatocyte nuclear factor 1alpha function in human hepatocellular adenomas leads to aberrant activation of signaling pathways involved in tumorigenesis. Hepatology. 2010;51(2):557–66.
Yang F, Huang X, Yi T, Yen Y, Moore D, Huang W. Spontaneous development of liver tumors in the absence of the bile acid receptor Farnesoid X Receptor. Cancer Res. 2007;67:863–7.
Wolf A, Thomas A, Edwards G, Jaseja R, Guo GL, Apte U. Increased activation of the Wnt/betacatenin pathway in spontaneous hepatocellular carcinoma observed in farnesoid X receptor knockout mice. J Pharmacol Exp Ther. 2011;338:12–21.
Keitel V, Reinehr R, Reich M, Sommerfeld A, Cupisti K, Knoefel W. The membranebound bile acid receptor TGR5 (GPBAR1) is highly expressed in intrahepatic cholangiocarcinoma. Hepatology. 2011;54:869.
Halilbasic E, Claudel T, Trauner M. Bile acid transporters and regulatory nuclear receptors in the liver and beyond. J Hepatol. 2013;58:155–68.
Lautem A, Heise M, Grasel A, HoppeLotichius M, Weiler N, Foltys D, Knapstien J, Schattenberg J, Schad A, Zimmermann A, et al. Downregulation of organic cation transporter 1 (SLC22A1) is associated with tumor progression. Int J Oncol. 2013;42:1297–304.
Demidenko R, Razanauskas D, Daniunaite K, Lazutka J, Jankevicius F, Jarmalaite S. Frequent downregulation of ABC transporter genes in prostate cancer. BMC Cancer. 2015;15:683.
Chen Y, Song X, Valanejad L, Vasilenko A, More V, Qiu X, Chen W, Lai Y, Slitt A, Stoner M, et al. Bile salt export pump is dysregulated with altered farnesoid X receptor isoform expression in patients with hepatocellular carcinoma. Hepatology. 2013;57(4):1530–41.
Schaeffeler E, Hellerbrand C, Nies A, Winter S, Kruck S, Hofmann U, van der Kuip H, Zanger U, Koepsell H, Schwab M. DNA methylation is associated with downregulation of the organic cation transporter OCT1 (SLC22A1) in human hepatocellular carcinoma. Genome Med. 2011;3:82.
Gupta N, Miyauchi S, Martindale R, Herdman A, Podolsky R, Miyake K, Mager K, Mager S, Prasad P, Ganapathy M, et al. Upregulation of the amino acid transporter ATB),+(SLC6A14) in colorectal cancer and metastasis in humans. Biochim Biophys Acta. 2005;1741(1–2):215–23.
Bhutia Y, Babu E, Prasad P, Ganapahty V. The amino acid transporter SLC6A14 in cancer and its potential use in chemotherapy. Asian J Pharm Sci. 2014;9:293–303.
Lambin P, RiosVelazquez E, Leijenaar R, Carvalho S, van Stiphout R, Granton P, Zegers C, Gilles R, Boellard R, Dekker A, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer. 2012;48(4):441–6.
Kumar V, Gu Y, Basu S, Berglund A, Eschrich S, Schabath M, Forster K, Aerts H, Dekker A, Fenstermacher D, et al. Radiomics: the process and the challenges. Magn Reson Imaging. 2012;30(9):1234–48.
Wasserman P. Advanced Methods in Neural Computing. New Yourk: Van Nostrand Reinhold; 1993.
Donoho D. Denoising by softthresholding. IEEE Trans Inf Theory. 1995;41(3):613–27.
Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81:425–55.
Aerts H, Velazquez E, Leijenaar R, Parmar C, Grossmann P, Cavslho S, Bussink J, Monshouwer R, HaibeKains B, Rietveld D, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006.
Okimoto GS. Data and code in support of the JAMMIT paper in BioData Mining. Retrieved from osf.io/2s3zd. 2016.
Acknowledgments
We thank the staff of the Pathology Shared Resource (PSR) and Genomic Shared Resource (GSR) of the University of Hawaii Cancer Center for their support of tissue and genomic data collection for this study. The PSR and GSR are supported in part by NCI P30 CA071789. We also thank the staff of The Hamamatsu/Queen’s PET Imaging Center (PIC) for supporting the image acquisition and analysis for this study. The PIC is supported NCI R01 CA16120904.
Funding
Study design, tissue and data acquisition, data analysis, and interpretation of results were supported by the following grants: ARRA grant NCI P30 CA07178912S7; NCI R01 CA16120904; and NCI P30 CA071789.
Availability of data and materials
All MATLAB code (version R2013b) and data used to generate the results of this study are publicly available at Open Science Framework (DOI 10.17605/OSF.IO/JUAB9) [65]. Please address any questions and/or comments regarding the data and/or code to the corresponding author.
Authors’ contributions
GO, AZ, TW, and SK conceived and designed experiments and simulations. GO, AZ, TW, JBN, TF, MT, SK performed the experiments and simulations. GO, MT, ML, BH,.OC, LW, SK were involved in data acquisition, storage, and management. GO, AZ, TW, JBN, TF, SK collaborated on data analysis and interpretation of results. GO, AZ, TW, JBN, SK helped to write the paper. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
The results of the ovarian cancer example were based on genomics and survival data downloaded from TCGA Research Network, which precluded the need for Institutional Review Board (IRB) approval. Written informed consent was obtained from all patients included in the imaginggenomics study in accordance with a clinical research protocol approved by the Queen’s Medical Center IRB that adhered to the ethical guidelines of the 1975 Declaration of Helsinki and subsequent amendments.
Author information
Affiliations
Corresponding author
Additional files
Additional file 1:
Estimating FDR profiles on a grid of ℓ_{1} penalties. (DOCX 59 kb)
Additional file 2:
Generation of simulated MMDS. (DOCX 86 kb)
Additional file 3:
Eigensurvival modeling of JAMMIT signatures. (DOCX 42 kb)
Additional file 4:
FDR profile of a JAMMIT analysis of multimodal data for ovarian cancer from TCGA. This table summarizes the relationship between ℓ_{1} penalties and FDR that is estimated based on 100 permutations of the supermatrix of a MMDS for ovarian cancer that integrates wholegenome mRNA, miRNA and DNA methylation data obtained from 291 patients with stage3 disease. Note the FDR profiles for each data type (columns 4, 6, and 8) are decreasing towards smaller values indicating that all 3 data types contribute to some degree to a “sparse” linear model of the SOI, with mRNA contributing the most in terms of FDR. In particular, row 19 (in red) is highlighted since it corresponds to a FDR for mRNA of 0.0034619 that is a local minimum of column 4. This FDR value is associated with an ℓ_{1} penalty of 0.002875 that results in a mRNA signature composed of 643 genes (FDR=0.0034619), a miRNA signature of 368 miRNAs (FDR=0.19912), a methylation signature of 450 methylation loci (FDR=0.03038), and a multimodal signature composed of a 1461 variables (FDR=0.067647). (DOCX 20 kb)
Additional file 5:
FDR profile for analysis of wholegenome gene expression data supervised by the K_{1}/k_{2} PET parameter. Note the K_{1}/k_{2} PET parameter (column 5) is selected for inclusion in the sparse linear model of the SOI for most ℓ_{1} penalties with FDR values of zero. Moreover, the FDR profile for genes (column 4) is rapidly decreasing indicating a strong signature for gene expression. These results taken together suggest that the K_{1}/k_{2} parameter is associated with gene expression via the sparse linear model for the SOI. In particular, row 12 (highlighted in red) corresponds to a FDR for mRNA of 0.00054949 that is a local minimum of column 4. This FDR value is associated with a ℓ_{1} penalty of 0.0089429 that results in a mRNA signature composed of 652 genes. (DOCX 19 kb)
Additional file 6:
FDR profile for analysis of wholegenome expression data supervised by the K_{1} PET parameter. This FDR profile indicates a lack of correlation between global gene expression and the K_{1} PET kinetic parameter. Note that the K_{1} PET parameter (column 5) is NOT selected for inclusion in the model of the SOI for all but the first ℓ_{1} penalty value (see row 1) with FDR values of 1.0. This result is in sharp contrast to the FDR profile for gene expression (column 4) where the FDR values rapidly decrease to small values. This result suggests that although there is a strong signal in the mRNA data matrix that contributes to the common SOI, this signal is not correlated with the K_{1} PET parameter. (DOCX 19 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Okimoto, G., Zeinalzadeh, A., Wenska, T. et al. Joint analysis of multiple highdimensional data types using sparse matrix approximations of rank1 with applications to ovarian and liver cancer. BioData Mining 9, 24 (2016). https://doi.org/10.1186/s1304001601037
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1304001601037
Keywords
 Generalized singular value decomposition
 Joint data analysis
 Ovarian cancer
 Hepatocellular carcinoma
 The Cancer Genome Atlas
 LASSO
 Sparse signal detection