Collection and processing of splice site data
The true and false ss sequences of Homo sapiens were collected from HS3D [15] (http://www.sci.unisannio.it/docenti/rampone/). The downloaded dataset contains a total of 2796 True donor Splice Sites (TSS) (http://www.sci.unisannio.it/docenti/rampone/EI_true.zip) and 90924 False donor Splice Site (FSS) (http://www.sci.unisannio.it/docenti/rampone/EI_false_1.zip). The sequences are of 140 bp long with conserved GT at 71st and 72nd positions respectively.
Both introns and exons have important role in the process of pre-mRNA splicing. To be more specific, presence of conserved-ness at both 5′ and 3′ ends of intron as well as exonic splicing enhancers [16, 17] is vital from splicing point of view. Besides, the length of an exon is also an important property for proper splicing [18]. It has been shown in vivo that internal deletion of consecutively recognized internal exons that are below ~50 bp may often lead to exon skipping [19]. As far as the length of an intron is concerned, Zhu et al. [20] carried out the functional analysis of minimal introns ranging between 50-100 bp and found that minimal introns are conserved in terms of both length and sequence. Hence, the window length of 102 bp [50 bp at exon-end + (GT + 50 bp) at intron-start] is considered here (Fig. 3).
Though in longer window length there is a less chance of existence of identical sequences, still we performed redundancy check to remove the identical TSS sequences from the dataset. To train the model efficiently, same number of unique FSS (equal to unique TSS) was considered by drawing at random from 90924 FSS. A sequence similarity search was then performed to analyze the sequence distribution, where each sequences of TSS was compared with the remaining sequences of TSS as well as with all the sequences of FSS and vice versa. The percentage of similarity between any two sequences was computed by assigning a score of 1 and 0 for every match and mismatch in nucleotides respectively, and the same is explained below for two sample sequences.
Sequence 1: ATTCGTCATG
Sequence 2: TCTAGTTACG
Score : 0010110101
Similarity (%)=(5/10)*100=50
Further, we prepared a highly imbalanced dataset consisting of 5%TSS and 95%FSS to assess the performance of RF as well as to compare its performance with that of SVM and ANN.
Computation of Position Weight Matrix (PWM)
The sequences of both TSS as well as FSS were position-wise aligned separately, using the di-nucleotide GT as the anchor. This position-wise aligned sequence data was then used to compute the frequencies and probabilities of nucleotides at each position. From a given set S of N aligned sequences each of length l, s
1
,…, s
N
, where s
k
= s
k1
,…, s
kl
(s
kj
є{A, C, G, T}, j = 1, 2, …, l), the PWM was computed as
$$ \begin{array}{l}{p}_{ij}=\frac{1}{n}{\displaystyle \sum_{k=1}^n{I}_i\left({s}_{kj}\right)}\left\{\begin{array}{l}i=A,\;C,\;G,\;T\\ {}\\ {}j=1,2, \dots,\;l\end{array}\right.\\ {}\\ {} where\;{I}_i(q)=\left\{\begin{array}{l}1\kern0.24em if\;i=q\\ {}\\ {}0\kern0.24em otherwise\end{array}\right.\;\\ {}\end{array} $$
(1)
The PWM with four rows (one for each A, C, G, and T) and 102 columns i.e., equal to the length of the sequence (Fig. 4) was then used for computing the di-nucleotide association scores.
Di-nucleotide association score
The adjacent di-nucleotide association scores are computed under proposed encoding procedures as follows:
-
1.
In the first procedure (P-1), the association between any two nucleotides occurring at two adjacent positions was computed as the ratio of the observed frequency to the frequency due to random occurrence of the di-nucleotide. For N position-wise aligned sequences, numerator is the number of observed di-nucleotide occurring together, whereas the denominator is N times of 0.0625 (=1/16, probability of occurrence of any di-nucleotide at random).
-
2.
In the second procedure (P-2), the association was computed as the ratio of the observed frequency to the expected frequency, where expected frequency was computed from PWM under the assumption of independence between the positions.
-
3.
In the third procedure (P-3), the di-nucleotide association was computed as the absolute value of the relative deviation of the observed frequency from the expected frequency, where expected frequency was computed as outlined in P-2.
In all the three procedures, the scores were transformed to logarithm scale (base 2) to make them uniform. The computation of the di-nucleotide association scores is explained as follows:
Let p
i
j
be the probability of occurrence of ith nucleotide at jth position, \( {p}_{j^{\prime}}^{i^{\prime }} \) be the probability of occurrence of i′ th nucleotide at j′th position and \( {n}_{j,{j}^{\prime}}^{i,{i}^{\prime }} \) be the frequency of occurrence of ith and i′th nucleotides together at jth and j′th positions respectively. Then the different di-nucleotide association scores between ith and i′th nucleotides occurring at jth and j′th positions under P-1, P-2, P-3 were computed using following formula
$$ \begin{array}{l}\left(\mathrm{P}-1\right)\to {s}_{\left(j,{j}^{\prime}\right)}^{\left(i,{i}^{\prime}\right)}={ \log}_2\left(\frac{n_{j,{j}^{\prime}}^{i,{i}^{\prime }}}{N*\;0.0625}\right)\\ {}\left(\mathrm{P}-2\right)\to {s}_{\left(j,{j}^{\prime}\right)}^{\left(i,{i}^{\prime}\right)}={ \log}_2\left(\frac{n_{j,{j}^{\prime}}^{i,{i}^{\prime }}}{N*{p}_j^i*{p}_{j^{\prime}}^{i^{\prime }}}\right)\ \mathrm{and}\\ {}\left(\mathrm{P}-3\right)\to {s}_{\left(j,{j}^{\prime}\right)}^{\left(i,{i}^{\prime}\right)}={ \log}_2\left|\frac{n_{j,{j}^{\prime}}^{i,{i}^{\prime }}-N*{p}_j^i*{p}_{j^{\prime}}^{i^{\prime }}}{N*{p}_j^i*{p}_{j^{\prime}}^{i^{\prime }}}\right|\;\end{array} $$
(2)
respectively, where \( {s}_{\left(j,{j}^{\prime}\right)}^{\left(i,{i}^{\prime}\right)} \) is the association score, N is the total number of sequence motifs in the data set; i,i′є{A,T, G, C} and j = 1, 2, …, (window length-1) and j′ = j + 1. A pseudo count of 0.001 was added to avoid the logarithm of zero in the frequency. For a clear understanding, computation of di-nucleotide association scores is given below, through an example with 5 different sequences.
Positions : 0123456789
Sequence 1: ATAC
GT
CATG
Sequence 2: TGTA
GT
TTCG
Sequence 3: ATGC
GT
ACAC
Sequence 4: GACT
GT
TGCT
Sequence 5: CCTG
GT
GAGA
Using these sequences, the random, observed and expected (under independence) frequencies for di-nucleotide AT occurring at positions 0, 1 respectively are computed as follows:
Observed frequency = Number of times AT occurs together at 0th and 1st positions respectively
=2
Random frequency = Number of sequences × Probability of occurrence of any of the 16 combinations of di-nucleotides at random (=1/16)
=5*0.0625
=0.3125
Expected frequency under independence = Number of sequences × Probability of independent occurrence of A at 0th position × Probability of independent occurrence of T at 1st position
=5*(2/5)*(2/5)
=0.8
In similar way, the frequencies can be calculated for other possible di-nucleotide combinations (AA, AG, AC, TA, …, CC) occurring at all possible adjacent positions. Now, the association scores for three different procedures P-1, P-2 and P-3 can be calculated by using equation (2) as
$$ \mathrm{P}\hbox{-} 1\to {s}_{\left(0,1\right)}^{\left(A,T\right)}={ \log}_2\left(\frac{Observed}{Random}\right)={ \log}_2\left(\frac{2}{0.3125}\right), $$
$$ \mathrm{P}\hbox{-} 2\to {s}_{\left(0,1\right)}^{\left(A,T\right)}={ \log}_2\left(\frac{Observed}{Expected}\right)={ \log}_2\left(\frac{2}{0.8}\right) $$
and
$$ \mathrm{P}\hbox{-} 3\to {s}_{\left(0,1\right)}^{\left(A,T\right)}={ \log}_2\left|\frac{Observed- Expected}{Expected}\right|={ \log}_2\left|\frac{2-0.8}{0.8}\right| $$
Construction of scoring matrices
For a sequence of lbp long, l-1 combinations of two adjacent positions are possible. Again, in each combination, 16 pairs of nucleotides (AA, AT,…,CG, CC) are possible. Thus, scoring matrices, each of order 16× (l-1), were constructed using di-nucleotide association scores under all the three procedures. Figure 5 shows a sample scoring matrix for 102 bp window length.
Ten-fold cross-validation and encoding of splice site sequence
TSS and FSS sequence datasets were separately divided into 10 random non-overlapping sets for the purpose of 10-fold cross validation. In each fold, one set of TSS and one set of FSS together were used as the test dataset and remaining 9 sets of TSS and 9 sets of FSS together were used as the training dataset. This was performed because 10-fold cross validation procedure is a standard experimental technique for determining how well a classifier performs on a test data set [21]. For each training set, scoring matrices for TSS and FSS were constructed independently and then the difference matrix was derived by subtracting the TSS scoring matrix from the FSS scoring matrix. The training and test datasets were then encoded by passing the corresponding sequences through the difference matrix (Fig. 6), where each sequence was transformed into a vector of scores of length l-1. A detailed explanation on encoding of the sequence is provided in Additional file 1.
Classification using Random Forest
Let L(y, x) be the learning dataset, where x is a matrix of n rows (observations) and p columns (variables), y is the response variable that takes values from K classes. Then, the RF consists of ensemble of B tree classifiers, where each classifier is constructed upon a bootstrap sample of the learning dataset. Each classifier of RF votes each test instances to one of the pre-defined K classes. Finally, each test instance is predicted by the label of winning class. As the individual trees are constructed upon a bootstrap sample, on an average 36.8 % \( \left[{\left(1-\frac{1}{n}\right)}^n\approx \frac{1}{e},\left(e\approx 2.718\right)\right] \) of instances do not play any role in the construction of each tree, and are called as Out Of Bag (OOB) instances. These OOB instances are the source of data used in RF for estimating the prediction error (Fig. 7). RF is computationally very efficient and offers high prediction accuracy with less sensitiveness to noisy data. For classification of TSS and FSS, RF was chosen over the other classifiers as it is a non-parametric (i.e., it does not make any assumption about the probability distribution of the dataset) method as well as its ability to handle large data sets. For more details about RF, one can refer [10].
Tuning of parameters
There are two important parameters in RF viz., number of variables to choose at each node for splitting (mtry) and number of trees to grow in the forest (ntree). Tuning of these parameters is required to achieve maximum prediction accuracy.
mtry
A small value of mtry produces less correlated trees that consequently results in lower variance of prediction. Though, integer (log2(p+1)) number of predictors per node has been recommended by Breiman [10], this mayn’t provide best possible result always. Thus, RF model was executed with different mtry values i.e., 1, √p, 20 %*p, 30 %*p, 50 %*p and p to find out the optimum one. The parameterization that generated the lowest and stable OOB Error Rate (OOB-ER) was chosen as the optimal mtry.
ntree
Many times, the number of trees to be grown in the forest for getting the stable OOB-ER is not known. Moreover, OOB-ER is totally dependent on the type of data, where the stronger predictor leads to quicker convergence. Therefore, the RF was grown with different number of trees, and the number of trees after which the error rate got stabilized was considered as the optimal ntree.
Margin function
Margin function is one of the important features of RF that measures the extent to which the average vote for right class exceeds the average vote for any other class. Let (x, y) be the training set having n number of observations where each vector of attributes (x) is labeled with class yj (where, j = 1, 2 for binary class), i.e., the correct class is denoted by y (either y1 or y2). Further, let prob (yj) be the probability of class yj, then the margin function of the labeled observation (x, y) is given by
$$ m\left(\mathbf{x},\ \mathrm{y}\right)= prob\left[h\left(\mathbf{x}\right)=y\right]-\overset{2}{\underset{\begin{array}{l}j=1\\ {}{y}_j\ne y\end{array}}{ \max }} prob\left[h\left(\mathbf{x}\right)={y}_j\right] $$
If m (x, y) > 0, then h (x) correctly classifies y, where h (x) denotes a classifier that predicts the label y for an observation x. The value of margin function always lies between-1 to 1.
Implementation
The RF code was originally written in Fortran by Breiman and Cutler and also included as a package “randomForest” in R [22] and this package was implemented (for execution of RF model) on a windows server (82/GHz and 32 GB memory). Run time was dependent on data size and mtry, ranging from 1 second per tree to over 10 seconds per tree.
Performance metrics
The performance metrics viz., Sensitivity or True Positive Rate (TPR), Specificity or True Negative Rate (TNR), F-measure, Weighted Accuracy (WA), G-mean and Matthew’s Correlation Coefficient (MCC), all of which are the functions of confusion matrix, were used to evaluate the performance of RF. The confusion matrix contains information about the actual and predicted classes. Figure 8 shows the confusion matrix for a binary classifier, where TP is the number of TSS being predicted as TSS and TN is the number of FSS being predicted as FSS, FN is the number of TSS being incorrectly predicted as FSS and FP is the number of FSS being incorrectly predicted as TSS. The different performance metrics are defined as follows:
$$ \mathrm{T}\mathrm{P}\mathrm{R}\ \mathrm{or}\ \mathrm{Sensitivity} = \frac{TP}{TP+FN}\kern0.36em \left( Same\;as\; recall\;for\; binary\; classification\right) $$
$$ \mathrm{T}\mathrm{N}\mathrm{R}\ \mathrm{or}\ \mathrm{Specificity} = \frac{TN}{TN+FP} $$
$$ \mathrm{F}-{\mathrm{measure}}^{\left(\alpha \right)}=\frac{\left(1+\alpha \right)\times recall\times precision}{\left(\alpha \times recall\right)+ precision}\kern0.48em \left(\alpha\;takes\; discrete\; values\right),\ \mathrm{Precision}=\frac{TP}{TP+FP} $$
$$ \mathrm{F}-{\mathrm{measure}}^{\left(\beta \right)}=\frac{\left(1+{\beta}^2\right)\times recall\times precision}{\left({\beta}^2\times recall\right)+ precision}\kern0.24em \left(\beta\;takes\; discrete\; values\right) $$
$$ \mathrm{W}\mathrm{A}=\frac{1}{2}\left(\frac{TP}{TP+FN}+\frac{TN}{TN+FP}\right) $$
$$ \mathrm{G}-\mathrm{Mean} = \sqrt{\left(\frac{TP}{TP+FN}\right)\left(\frac{TN}{TN+FP}\right)} $$
$$ \mathrm{M}\mathrm{C}\mathrm{C}=\frac{\left(TP\times TN\right)-\left(FP\times FN\right)}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}} $$
Comparison of RF with SVM and ANN
The performances of RF was compared with that of SVM [23], ANN [24] using the same dataset that was used to analyze the performance of RF. The “e1071” [25] and “RSNNS” [26] packages of R software were used for implementing the SVM and ANN respectively. The SVM and ANN classifiers were chosen for comparison because these two techniques have been most commonly used for prediction purpose in the field of bioinformatics. In classification, SVM separates the different classes of data by a hyper-plane. In terms of classification performance, the optimal hyper-plane is the one that separates the classes with maximum margin (a clear gap as wide as possible). The sample observations on the margins are called the support vectors that carry all the relevant information for classification [23]. ANNs are non-linear mapping structures based on the function of neural networks in the human brain. They are powerful tools for modeling especially when the underlying relationship is unknown. ANNs can identify and learn correlated patterns between input datasets and corresponding target values. After training, ANNs can be used to predict the outcome of new independent input data [24]. The SVM model was trained with the radial basis function (gamma = 0.01) as kernel. In the ANN model, multilayer perceptron was used with “Randomize_Weights” as initialization function, “Std_Backpropagation” as learning function and “Act_Logistic” as hidden activation function. The 10-fold cross validation was performed for SVM and ANN, similar to RF. All the three techniques were then compared in terms of performance metrics. Also, the MCC values of RF, SVM and ANN were plotted to analyze the consistency over 10 folds of the cross validation. A similar kind of comparison between RF, SVM and ANN was also made using the imbalanced dataset. To handle the imbalanced data, one additional parameter i.e., cutoff was used in RF, where 90 % cutoff was assigned to the major class (class having larger number of observations) i.e., FSS and 10 % to the minor class (class having lesser number of observations) i.e., TSS, based on the degree of imbalanced-ness in the dataset. Similarly, one additional parameter i.e., class.weights was used in SVM model, and the weights used were 19 and 1 for TSS and FSS respectively (keeping in view the proportion of TSS and FSS in the dataset). However, no parameter to handle imbalanced-ness was found in “RSNNS” package, therefore the same model of ANN was trained using imbalanced data.
In the case of imbalanced test dataset, the performance metrics were computed by assigning weights w1 to TP & FN and w2 to FP & TN. Here, w1
\( =\raisebox{1ex}{${n}^{FSS}$}\!\left/ \!\raisebox{-1ex}{$\left({n}^{TSS}+{n}^{FSS}\right)$}\right. \) and w2
\( =\raisebox{1ex}{${n}^{TSS}$}\!\left/ \!\raisebox{-1ex}{$\left({n}^{TSS}+{n}^{FSS}\right)$}\right. \), where nTSS is the number of TSS and nFSS is the number of FSS in the test dataset. Further, the Mann Whitney U test at 5 % level of significance was performed to evaluate the difference among the prediction accuracies of RF, SVM and ANN, by using the stats package of R-software.
Comparison with other prediction tools
The performance of the proposed approach was also compared with other splice site prediction tools such as MaxEntScan (http://genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html), SpliceView (http://bioinfo4.itb.cnr.it/~webgene/wwwspliceview_ex.html) and NNSplice (http://www.fruitfly.org/seq_tools/splice.html) using an independent test set. Besides, three more methods viz., Maximal Dependency Decomposition (MDD), Markov Model of 1st order (MM1) and Weighted Matrix Method (WMM) given under MaxEntScan were also used for comparison. The independent test set was prepared using two different genes (AF102137.1 and M63962.1) downloaded from Genbank (http://www.ncbi.nlm.nih.gov/genbank/) randomly. Comparison among the approaches was made using the values of performance metrics.
Web server
A web server for the prediction of donor splice sites was developed using HTML and PHP. The developed R-code was executed in the background using PHP script upon the submission of sequences in FASTA format. The web page was designed to facilitate the user for a sequence input, selection of species (human) and encoding procedures. In the server, the model has been trained with human splice site data and the user has to supply only the test sequence (s) of his/her interest to predict the donor splice sites.