Datasets
In this study we utilized the multi-omics ovarian data for training and three datasets in GEO were used as the independent tests. The details about these four datasets were introduced in following:
TCGA dataset
-
We downloaded the multi-omics ovarian cancer data from TCGA public datasets((https://portal.gdc.cancer.gov). The R package TCGA-assemble2 [13] was used for data collection and we obtained 298 samples concluded three types of omics data: mRNA-seq data (UNC Illumina HiSeq_RNASeq V2), miRNA-seq data (BCGSC Illumina HiSeq) and copy number variation (CNV) data (BROAD-MIT Genome wide SNP_6). All these data were obtained from the TCGA level 3 data. And the CNV feature was extracted by averaging the copy numbers of all CNV variations on one gene. After that the features and samples which missing more than 20% would be excluded. For the remaining data, the missing values were imputed based on the median values by using R package “imputeMissings” [14].
Test datasets
-
In GSE26712 we downloaded the RNA-seq and the clinical information of 185 ovarian cancer patients shared from Surgical Oncology Research Lab of Boston, and in GSE32062 we got 260 ovarian cancer case samples offered by Obstetrics and Gynecology, Niigata University. GSE53963 contains mRNA information of 174 samples from UCLA School of Medicine. All of these test datasets can be downloaded in Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov)
The architecture of proposed deep learning framework
In Fig. 1 we show the architecture of proposed deep learning framework, firstly the multi-omics ovarian cancer features x (mRNA, miRNA and CNV) are inputted into the DAE for generating the low dimensional representation z. And then the reconstructed features z are used to cluster the patients using the k-means. Based on the clustered subtypes from k-means, we further built the light-weighted logistic regression classification model with mRNA expression data to reducing the features required for patients’ classification. The available code of this deep learning framework was shared in https://github.com/Hua0113/DAE_km.
Denoising autoencoder for dimensionality reduction
The Autoencoder (AE) is one of the deep neural network that used to copy its input to its output, and supposing the bottleneck layer z can be seen as the represent of the input features. AE consists of two parts: the encoder part z = fe(x) and the decoder part x′ = fd(z), and the loss function of AE can be expressed as:
$$ {l}_{AE}={\left\Vert x-{x}^{\prime}\right\Vert}_2^2={\left\Vert x-{f}_d\left({f}_e(x)\right)\right\Vert}_2^2 $$
(1)
Different as traditional AE, DAE constructs partially damaged data by adding noise to the input data, and restores it to the original input data by encoding and decoding, which make the deep neural network has the ability to identify useful features, the new generated input \( \overset{\sim }{x} \) can be expressed as:
$$ \overset{\sim }{x}={q}_{\mathcal{D}}\left(\overset{\sim }{x}|x\right) $$
(2)
Where \( {q}_{\mathcal{D}} \) represents the stochastic mapping. And then the corrupted input \( \overset{\sim }{x} \) is inputted to a deep neural network for encoding and decoding with the same process of the standard autoencoder. The loss of the DAE is written as:
$$ {l}_{AE}={\left\Vert x-{x}^{\prime}\right\Vert}_2^2={\left\Vert x-{f}_d\left({f}_e\left(\overset{\sim }{x}\right)\right)\right\Vert}_2^2 $$
(3)
In our study, the DAE is a 7 layers deep neural network (input, output, and 5 hidden layers), the nodes of the 5 hidden layers were set 200, 50, 2, 50 and 200 respectively. In each layer, we used tanh as the nonlinear activation function and the DAE was trained by back-propagation via the Adam optimizer. The learning rate of our model was set 0.001, the batch size was set 256 and the epoch was set 100. These parameters were selected for maximizing the silhouette score in OV.
K-means clustering using reconstructed features
The DAE was used to construct the low dimensional features of the multi-omics data from the bottleneck layer. After obtaining the reconstructed features, K-means method was used for ovarian cancer subtypes clustering. We determined the optimal number of clusters with silhouette score [15]. We test the k from [2, 8] and set k = 2 because of the highest silhouette score.
Logistic regression method for subtypes classification
After obtaining the labels clustered by k-means, we built a light-weighted mRNA model for reducing the number of genes needed to identify cancer subtypes by using logistic regression algorithm. Here we used the mRNA omics data as the features X and the subtypes clustered based on our DAE-kmeans framework as the label Y. Defining \( {x}_i^{mrna}\in X \) represent the mRNA features of patient i, yi ∈ Y is the subtypes of the patient i (Low = 0, High =1), β is the coefficient vector, the logistic regression can be expressed as
$$ p\left({y}_i=1\right)=\frac{\mathit{\exp}\left({\beta}_0+\beta {x}_i^{mrna}\right)}{1+\mathit{\exp}\left({\beta}_0+\beta {x}_i^{mrna}\right)} $$
(4)
Where p(yi = 1) represent the probability that the patient i belongs to the high-risk group. The log-likelihood function of logistic model is written as:
$$ l\left(\beta \right)=\sum \limits_{i=1}^n\left\{{y}_i\ln \left({p}_i\right)+\left(1-{y}_i\right)\ln \left(1-{p}_i\right)\right\} $$
(5)
Many different regularizations are used to improve the generalization ability of the model [16, 17]. Considering to reduce the number of features in constructing the classification model, the L1 regularization is combined with logistic regression method:
$$ \beta =\mathrm{agr}\ \max \left[\sum \limits_{i=1}^n\left\{{y}_i\mathit{\ln}\left({p}_i\right)+\left(1-{y}_i\right)\mathit{\ln}\left(1-{p}_i\right)\right\}-\lambda g\left(\beta \right)\right] $$
(6)
Here all the samples in TCGA were used as training data, and the classification model obtained with logistical regression method was evaluated in three ovarian cancer mRNA datasets from GEO as the independent test.
Evaluations of ovarian cancer subtypes identification
We implemented different methods for comparing the performances of different cluster methods: k-means, hierarchical clustering, k-means using the reconstructed features by PCA (PCA-kmeans), SparseK, iCluster, k-means using the reconstructed features by KPCA (KPCA- kmeans), k-means using the reconstructed features by AE (AE-kmeans) and DAE-kmeans. The silhouette score is used to measure the cluster performance and the log rank p-value to measure the differences of the different subtypes of cancers. The higher silhouette score means the method achieved better performance for clustering, and the lower log-rank p-value means the greater differences in cancer subtypes.
Functional analysis
Based on the identified subtypes of ovarian cancer, differentially expressed gene (DEG) analysis was applied by using R package DESeq2 [18], which the genes with corrected p-values < 0.05 and |log2 fold change| ≥ 1 were seen as the DEGs. And we also applied WGCNA analysis by the R package WGCNA [19], for identifying function modules and genes related to ovarian cancer subtypes. In WGCNA analysis, we selected the unsigned network, the least genes in each module was set 30, and height cut-off parameter used to merge similar modules was set 0.25. The genes in each module which have a higher relevance score (> 0.5) were defined as the hub genes (HGs). At last the genes which both belong to DEGs and HGs are seen as candidate genes which highly related to ovarian cancer. And the enriched pathways were obtained by these genes based on the KOBAS online tool [20].