### Datasets

In this study, we collected eight independent colorectal cancer datasets [16] including GSE13067, GSE13294, GSE14333, GSE17536, GSE20916, GSE2109, GSE37892, and GSE39582 (Supplementary Table S1). All those datasets can be downloaded from the official repository of an international CRC subtyping consortium on Synapse (https://www.synapse.org/#!Synapse:syn2623706/wiki/) (downloaded on 1 Oct 2020).

### Gene expression analysis

Since the tumor microenvironment makes an important contribution to gene expression [1], we identify all differential genes with discriminative gene expression values among the cancer subtypes (namely, subtype-specific genes).

In this study, we focus on the well-established consensus molecular subtypes (CMS), i.e. CMS1-CMS4. Each subtype is characterized by its own unique feature: CMS1 (microsatellite instability immune), CMS2 (canonical), CMS3 (metabolic), and CMS4 (mesenchymal). The definitions can be found in [16]. Such high-dimensional data often brings difficulties in computational resources (i.e., computer memory). Therefore, the primary task is to select meaningful features from such high-dimensional gene expression data. Recognizing that there were four consensus molecular subtypes in this study, we combined these subtypes into six groups: CMS1 vs CMS2, CMS1 vs CMS3, CMS1 vs CMS4, CMS2 vs CMS3, CMS2 vs CMS4, and CMS3 vs CMS4. The subtype-specific genes are identified based on the fold-change and adjust *P* value (also called *Q* value). Mathematically, the fold-change can be defined as follows:

$$ {fold-change}=\frac{\overset{-}{A}}{\overset{-}{B}} $$

(1)

while *A* and *B* denote the genes expression values of the same gene in different subtypes, \(\overset {-}{A}\) is the average value of *A* and \(\overset {-}{B}\) is the average value of *B*. In common, the *l**o**g*_{2} fold-change is widely used to normalize the range of fold-change. The *l**o**g*_{2} fold-change can be defined as follows:

$$ log_{2}\: \: {fold-change}=log_{2}{\overset{-}{A}}-log_{2}{\overset{-}{B}} $$

(2)

However, the fold change has a drawback that the misclassification of differentially expressed genes with large differences may result in poor identification of changes in high expression levels.

To address it, the *P* value is implemented as an alternative to the rejection point and provided the minimum level at which the initial hypothesis will be rejected. The *P* value we used is adjusted by the Benjamini-Hochberg (BH) program [17], also called *Q* value. In this study, we take the *T-test* to compute every gene’s *P* value. More formally, the *T-test* is defined as follows:

$$ T=\frac{\overset{-}{A}-\overset{-}{B}}{\sqrt{S^{2}_{A}/n+S^{2}_{B}/n}} $$

(3)

Then the significance *P* value is calculated according to the *T* distribution to measure the significance of this difference. Next, we took BH to obtain a *Q* value. More formally, the *Q* value in this study is defined as follows:

After that, we calculated the fold-change and *Q* value for each CMS group. Then, the genes with ∣*l**o**g*_{2} fold-change ∣>1 and *Q* value <0.05 are retained and identified. Finally, we input those subtype-specific genes into deep learning and then construct our DeepCSD model.

### Deep learning for cancer subtype diagnosis (DeepCSD)

Our deep neural network (DNN) is a feed-forward neural network assembled by a sequential layer-by-layer structure that realizes a series of functional transformations. Each layer is fed with the previous layer’s outputs as its inputs is to execute its transformation function. Specifically, every layer consists of multiple “neurons”. The input data will pass every layer by the “connections” with the weight parameters between “neurons”. The basic layers in DeepCSD are the representation layers and dropout layer (as shown in Fig. 1). DeepCSD features the sequential alternating representation layers with nonlinear activation functions (i.e. ReLU) and dropout layers, followed by representation layers with softmax activation functions for computing probability of each CMS (more detail can be seen in Fig. 1). Mathematically, it is recursively defined as follows:

$$ f^{l}(X)=\alpha_{l}(W_{l}R_{l-1}f^{l-1}(X)-\theta_{l})\: \: \:l=1,2,3 $$

(5)

while *X* is the model’s inputs, *f*^{l−1}(*X*) is the (*l*−1)-th layer’s output, *θ*_{l} is the *l*-th layer’s threshold, and *α*_{l} is the *l*-th layer’s activate function. It is worth noticing that *R*_{0} does not exist. In other words, the dropout layer is not employed in DeepCSD’s first layer.

The natural method is to process the original inputs layer by layer to obtain “deep features”. In other words, every layer in the framework of deep learning is to extract deep features from the previous layer’s outputs. The more layers the DNN has, the deeper features the model can extract. In theory, the addition of layers and parameters can increase the model expressiveness for addressing complex learning tasks. However, the possibility of overfitting is also increased [10]. To reduce the influence of overfitting, *L*_{2} regularization and dropout layers are employed in our model. For the dropout layer, it is to abandon a part of neurons to mitigate the overfitting. The brief implementation step is defined as follows:

$$ f^{'}(x)=R*f(x) $$

(7)

where the Bernoulli function is to randomly generate a binary mask vector with Bernoulli trial probability *p*, *f*(*x*) is the input of dropout layer, and \(\phantom {\dot {i}\!}f^{'}(x)\) is the output [18].

After that, DeepCSD applies *L*_{1} and *L*_{2} regularization to representation layers 1 and 2 by adding a penalty term to the empirical loss. It is worth to be noticed that the output layer in DeepCSD does not employ such regularization.

### Training of the DeepCSD model

For such a multi-class classification task, we minimized the objective function of DeepCSD defined as follow:

$$ min\;\;loss=\sum_{i=0}^{n}\sum_{t=0}^{3}y_{i,t}ln(y_{i,t}^{'})+\frac{\lambda }{2}||W||_{1}+\frac{\lambda }{2}||W||_{2} $$

(8)

while *n* is the number of samples, \(y_{i,t}^{'}\) is the *t*-th neuron’s output of *i*-th sample, *λ* is the learning rate, \(\frac {\lambda }{2}||W||_{1}\) is the term of *L*_{1},and \(\frac {\lambda }{2}||W||_{2}\) is the term of *L*_{2}. With one-hot label vector encoding, the objective function can be simplified as follow:

$$ {\displaystyle \begin{array}{c}\mathit{\min}\kern5.55pt loss=\sum \limits_{i=0}^n\mathit{\ln}\left({y}_{i,t}^{\ast}\right)+\frac{\lambda }{2}{\left\Vert W\right\Vert}_2\\ {} subject\kern0.34em to\kern5.55pt {y}_{i,t}=1,\sum \limits_{j=0}^3{y}_{i,j}=1\end{array}} $$

(9)

while \(y_{i,t}^{*}\) indicates the output label index of the element 1 in one-hot vector for *i*-th sample.

The Adam algorithm has been chosen as our model’s optimizer [19]. The update rule is defined as follow:

$$ \Theta_{t}=\theta_{t-1}-\alpha \frac{m_{t}}{1-\beta_{1}^{t}}/(\frac{v_{t}}{1-\beta_{2}^{t}}+\varepsilon) $$

(10)

while

$$ m_{t}=\beta_{1}m_{t-1}+(1-\beta_{1})g_{t} $$

(11)

$$ v_{t}=\beta_{2}m_{t-1}+(1-\beta_{2})g_{t}^{2} $$

(12)

$$ g_{t}=\bigtriangledown_{\Theta} f_{t}(\theta_{t-1}) $$

(13)

We can see that if we compute the parameter of *g*_{t}, the Adam algorithm automatically updates the corresponding parameter. Next, we take the update process of *W*_{3} as a sample to explain the detail of our model update process. Mathematically, the output is defined as follow:

$$ y_{i,t}^{*}={softmax}(W_{3}R_{2}f^{2}(X_{i})-\Theta_{3}) $$

(14)

We focus on the parameter updating process of *W*_{3}. The Adam algorithm makes some adjustment to *W*_{3} if the *g*_{t} is computed. The *g*_{t} for *W*_{3} is calculated as follow:

$$ g_{t}=\frac{\partial loss}{\partial W_{3}} $$

(15)

while

$$ \frac{\partial loss}{\partial W_{3}}= \frac{\partial lny_{i,t}^{*}}{\partial y_{i,t}^{*}}\frac{\partial y_{i,t}^{*}}{\partial (W_{3}R_{2}f^{2}(x)-\Theta_{3})}\frac{\partial W_{3}R_{2}f^{2}(x)-\Theta_{3}}{\partial W_{3}} $$

(16)

$$ \frac{\partial loss}{\partial W_{3}}= y_{i,t}^{*-1}\frac{\partial y_{i,t}^{*}}{\partial (W_{3}R_{2}f^{2}(x)-\Theta_{3})}R_{2}f^{2}(x) $$

(17)

Next, we focus on partial derivative calculation of activation function for (*W*_{3}*R*_{2}*f*^{2}(*x*)−*Θ*_{3}). Firstly, the partial derivative calculation of softmax activation function for *k*-th neuron is calculated as follow:

$$ \frac{\partial y_{i,t}^{*}}{\partial k}= \frac{\frac{\partial e^{i,t}}{\partial k}\sum_{j=0}^{3}e^{i,j}-e^{i,t}\sum_{j=0}^{3}\frac{\partial e^{i.j}}{\partial k}}{(\sum_{j=0}^{3}e^{i,j})^{2}} $$

(18)

It can result in two different outputs according to the value of *k*, i.e.

$$ \frac{\partial y_{i,t}^{*}}{\partial k}= \left\{ \begin{array}{ll} \frac{e^{i,t}\sum_{j=0}^{3}e^{i,j}-(e^{i,t})^{2}}{(\sum_{j=0}^{3}e^{i,j})^{2}}=y_{i,t}^{*}(1-y_{i,t}^{*}) & if(t=k)\\ \frac{e^{i,t}e^{i,k}}{(\sum_{j=0}^{3}e^{i,j})^{2}}=-y_{i,t}^{*}y_{i,k}^{*} & if(t\neq k) \end{array}\right. $$

(19)

Therefore, we view *W*_{3} as (*w*_{1},*w*_{2},*w*_{3},*w*_{4})^{T} (expand matrix *W*_{3} by row). Then, the partial derivative calculation of activation function for (*W*_{3}*R*_{2}*f*^{2}(*x*)−*Θ*_{3}) is divided in two situations as follows:

$$ \frac{\partial y_{i,t}^{*}}{\partial (w_{k}R_{2}f^{2}(x)-\Theta_{3,k})}= \left\{ \begin{array}{ll} y_{i,t}^{*}(1-y_{i,t}^{*}) & if(k=t)\\ -y_{i,t}^{*}y_{i,k}^{*} & if(k\neq t) \end{array} \right. $$

(20)

The final *g*_{t} is given by the following formula:

$$ \frac{\partial loss}{\partial w_{k}}= \left\{ \begin{array}{ll} (1-y_{i,t}^{*})R_{2}f^{2}(x)& if(k=t)\\ -y_{i,k}^{*}R_{2}f^{2}(x)& if(k\neq t) \end{array} \right. $$

(21)

Combining those column vectors into a matrix in order, the *g*_{t} appeared. The \(y_{i,t}^{*}\) is the neuron’s output that its position is consistent with the real label, which means the possibility of the real label. We can conclude that *w*_{k} will be strengthened if “ *k*=*t*”; it will be punished if “ *k*≠*t*”.

### Model parameters and running time

In this study, those eight molecular datasets are employed for experimental comparisons. For the DeepCSD, the parameters of Adam follows the default setting [20] and the minimum learning rate was set to 10^{−5}. For SVM and RF, the parameter setting follows the default parameter values of Python Scikit-learn 0.21.2. For gcForest, we fixed this model to a multi-class mission with the literature default setting [22]. For XGBoost, *gamma*=0.1, *maxdepth*=12, *subsample*=0.7, *colsamplebytree*=0.7, *earlystoppingrounds*=15. For DeepCC, all parameter setting follows the reference [2]. Moreover, we also provide the running time of our model. DeepCSD is written in Python. We have relied on a server with CPU = Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, GPU = GTX1080 Ti. Meanwhile, we design our deep learning framework based on Keras. The versions of those software and packages are Python =3.7, Anaconda=3.7, Keras=2.1.0, Tensorflow=1.14.0. After running on those molecular datasets, the average running time is 457 seconds.