Predicting linear B-cell epitopes using amino acid anchoring pair composition

Accurate identification of linear B-cell epitopes plays an important role in peptide vaccine designs, immunodiagnosis, and antibody productions. Although several prediction methods have been reported, unsatisfied accuracy has limited the broad usages in linear B-cell epitope prediction. Therefore, developing a reliable model with significant improvement on prediction accuracy is highly desirable. In this study, we developed a novel model for prediction of linear B-cell epitopes, APCpred, which was derived from the combination of amino acid anchoring pair composition (APC) and Support Vector Machine (SVM) methods. Systematic comparisons with the existing prediction models demonstrated that APCpred method significantly improved the prediction accuracy both in fivefold cross-validation of training datasets and in independent blind datasets. In the fivefold cross-validation test with Chen872 dataset at window size of 20, APCpred achieved AUC of 0.809 and accuracy of 72.94%, which was much more accurate than the existing models, e.g., Bayesb, Chen’s AAP methods and the enhanced combination method of AAP with five AP scales. For the fivefold cross-validation test with ABC16 dataset, APCpred achieved an improved AUC of 0.794 and ACC of 73.00% at window size of 16, and attained an AUC of 0.748 and ACC of 67.96% on Blind387 dataset after being trained with ABC16 dataset. Trained with Lbtope_Confirm dataset, APCpred achieved an increased Acc of 55.09% on FBC934 dataset. Within sequence window sizes from 12 to 20, APCpred final model on homology-reduced dataset achieved an optimal AUC of 0.748 and ACC of 68.43% in fivefold cross-validation at the window size of 20. APCpred model demonstrated a significant improvement in predicting linear B-cell epitopes using the features of amino acid anchoring pair composition (APC). Based on our study, a webserver has been developed for on-line prediction of linear B-cell epitopes, which is a free access at: http:/ccb.bmi.ac.cn/APCpred/.

protein structure [2]. Although it is believed that the majorities of B-cell epitopes are discontinuous, detection of continuous epitopes still plays an important role in experimental designs, immunodiagnostic tests, and vaccines production [3,4]. However, development of a reliable computational method for predicting linear B-cell epitopes has been a daunting task with little success.
Previously, several studies have been conducted focusing on the correlations between physicochemical properties of certain amino acids and the linear B-cell epitopes within protein sequences. As a result, some epitope prediction methods have been constructed using physicochemical properties of amino acids, such as hydrophilicity [5], flexibility [6], turns [7], and solvent accessibility propensity scales [8]. These prediction models are simply based on the average of physicochemical values of amino acids at a window. However, these prediction models demonstrated only marginally better results than random selections [9]. Thus, new approaches should be developed to improve performance for prediction of linear B-cell epitopes.
Recently, some studies have attempted to improve the prediction accuracy using machine learning approaches. For example, the ABCpred [10] was developed using artificial neural network method. This model was constructed and evaluated using fivefold cross-validation tests on a training dataset, which was composed of a non-redundant dataset of 700 B-cell epitopes and 700 non-epitope peptides. Its input sequences ranged from 10 to 20 amino acids on the experimental design, and the best performance was achieved 65.93% prediction accuracy when ABCpred model was trained using recurrent neural network with a peptide dataset of 16 amino acids in length (ABC16). Then this model was further validated with a blind testing dataset (Blind387), and achieved 66.41% prediction accuracy.
Furthermore, Chen et al. [11] found that certain amino acid pairs (AAPs) tended to occur more frequently in B-cell epitopes, thus, an AAP propensity scale was used in combination with a support vector machine (SVM) to construct a prediction model, which reached an optimal accuracy of 71.09% on a dataset Chen872 containing 872 Bcell epitopes and 872 non-B-cell epitopes using fivefold cross-validation at window size of 20. Moreover, they combined the AAP scale and five amino acid propensity (AP) scales using the SVM classifier to improve the prediction accuracy, and the combination method achieved a better prediction accuracy of 72.54%. EL-Manzalawy et al. [12] reported an implemented AAP BCPred method and developed a more superior model (BCpred) over those previous methods by utilizing SVM string kernels, and achieved the highest AUC (area under the receiver operating characteristic curve) of 0.758. In their results, BCpred and AAP BCPred models both achieved improved prediction accuracies with fivefold cross-validation on ABC16 dataset, but attained lower prediction accuracy than ABCpred model when tested on blind dataset test [12]. Wee et al. [13] developed a SVM prediction model utilizing Bayes Feature Extraction -Bayesb. This Bayesb model achieved accuracy of 68.50% and AUC of 0.74 on testing with Chen's dataset. Moreover, Singh et al. [14] recently reported an improved method called LBtope for linear B-cell epitope prediction using large datasets derived from immune Epitope Database. Testing performances of LBtope on some benchmark datasets still remained unsatisfactory.
In this study, we present a novel method APCpred for linear B-cell epitope prediction, which was derived from the combination of amino acid anchoring pair composition (APC) and Support Vector Machine (SVMs) methods using diverse lengths of peptides (12 to 20-mers). The performances of this model were evaluated using different public datasets.

Datasets
In order to develop prediction models, we collected six datasets ( Table 1). The first dataset BCI727 was derived from the Bcipep database containing 2479 linear B-cell epitopes [15]. Each sample was a 20-mer peptide. If the epitope length was less than 20 amino acids, then the length was increased at both terminals by introducing equal number of residues derived from its original antigenic sequence [10]. If the epitope length was longer than 20 amino acids, the extra amino acids were removed at both terminals. In addition, we removed duplicated and highly homologous peptides by filtering the dataset based on 80% sequence identity using the CD-HIT program [16]. Furthermore, we obtained a dataset of 727 peptides (positive instances of B-cell epitopes) as positive samples. A total of 727 non-epitope peptides were generated by randomly extracting 20-mer peptide sequences from Swiss-Prot database while none of these negative instances occurred in the positive instances. This dataset was applied as the training dataset to develop our prediction model.
The second dataset, Chen872, was released by Chen [11], which contains 872 epitopes and 872 non-epitopes, and each of which was a 20-mer peptide. This dataset was used to evaluate our APCpred method in comparison with the Bayesb, Chen's AAP and the combination method of AAP and AP in terms of fivefold cross-validation.
The third dataset, ABC16, was available from the model ABCpred, which contains 700 epitopes and 700 non-epitopes, and each of which was a 16-mer peptide [10]. This dataset was used to evaluate ABCpred in comparison with BCpred, AAP BCpred and ABCpred in terms of fivefold cross-validation [12]. In addition, ABC16 was also used as training dataset for blind test in the next dataset Blind387 [10].
The fourth dataset, Blind387, was composed of 187 epitopes and 200 16-mer nonepitope peptides [10]. This dataset was used as a blind dataset to compare our model performance with the models BCpred, AAP BCpred and ABCpred.
The fifth dataset, Lbtope_Confirm, was derived from IEDB by Singh [14]. This dataset contained variable lengths of 1042 unique B-cell epitopes and 1795 non-epitopes.
The sixth dataset, FBC934, was constructed by EL-Manzalawy [17]. The FBC934 contains 934 B-cell epitopes and 934 non-epitopes with variable lengths. Among the datasets above, BCI727, Chen872 and ABC16 were applied to construct prediction models, which were evaluated by fivefold cross-validation. In addition, the dataset Blind387 was used as an independent dataset to test the performance of the models built from ABC16 dataset. Finally, both models APCpred and LBtope were developed using the dataset Lbtope_Confirm, and their performances were compared using the dataset FBC934.

Feature extraction and machine learning method
To construct the prediction model of B-cell linear epitopes, amino acid anchoring pairs of short sequences were employed to represent the epitopes and non-epitopes. Feature selection was used to filter out the noise information on the sequence profile data. The prediction model was built and evaluated by fivefold cross-validation. During the validation, feature selection was made as part of cross-validation by employing 4/5 part of data to be feature-selected while leaving out 1/5 part of data as independent evaluation data that was not included in the feature selection. In addition, the completed model was built on the feature selection from a full training dataset by machine learning method, and was tested with an independent dataset. The evaluating design was shown in Figure 1. The performances of fivefold cross-validation were then compared among different methods.

Amino acid anchoring pair composition
For each sequence in a dataset, we extracted sequences of amino acid anchoring pair composition (APC) by decomposing a protein or peptide sequences into 2-mer or nmer subsequences. We propose that the two terminal amino acids of subsequences are the anchoring point pair that may anchor each other to form a relatively stable structure, and the pair composition can be used as the features of the peptide sequences. For example, one sequence 'QAGTSLS' can be represented by the following features: Figure 2, each feature was weighted by the frequency divided by the maximum likelihoods as this: one feature 'A.{i}A' (A denotes one of 20 amino acids, i denotes the number of interval amino acids and it is an integer) exits in one short sequence in number k and the window size of short sequence is a integer l, then the quantity of 'A.{i}A' in short sequence is calculated as k/(l -i -1). For scanning all possible pairs, the number of interval amino acid pairs ranges from 0 to I (I denotes the max number of I, it is an integer) by step 1. Finally, there are 400*(I + 1) features describing each epitope sequence or non-epitope sequence. The setting of I is an important factor for prediction. To find the best I, we tested I = 2, I = 3, I = 4 on BCI727 dataset at the window size of 20. The best parameter would be used on APCpred.

Feature selection
Since there are many useless APC for discriminating epitopes from non-epitopes, we employed Student's t-test to remove these noise APC without affecting on the classification of epitopes and non-epitopes. Cutoff p-value is an important factor to select features for model building. Traditional levels such as 0.05 would be a good cutoff value. However, in this study we first tried p < 0.05 to eliminate the non-discriminable anchoring pair compositions, but the prediction accuracies of APCpred resulted in poor AUCs. This might be due to the background noises of the dataset we used. Bursac et al. [18] and Budtz-Jørgensen et al. [19] used p-value cut-off point of 0.25 and 0.2 respectively in their studies and generated satisfied selection. Therefore, to find best pvalue cutoff value, in this study we tried p < 0.2, p < 0.4, p < 0.5, p <0.6 and p = < 1 on the BCI727 at window size 20. The best parameter would be used on APCpred.

Support vector machines and kernel methods
We applied support vector machines (SVMs) to construct prediction models. SVMs are a class of supervised machine learning methods used for classification and regression, and have been widely used in algorithm and modeling study [20]. Given a set of labeled training data (x i , y i ), where x i ∈ R d and y i ∈ {+1, −1}, training a SVM classifier involves finding a hyper-plane that maximizes the geometric margin between positive and negative training data samples. In this study, every component of the input vector x was the sequence anchoring pair occurring in the peptide.
When performing classification using the SVM classifier, because x is a combination of different features of a peptide, RBF (Radial Basis Function) kernel was used. The RBF is by far the most popular choice of kernel type used in SVM for its localization property. It is defined as [21]: The λ is a control parameter reflecting the kernel width.
For the RBF kernel, we found that tuning the SVM cost parameter C and the RBF kernel parameter γ were necessary and important to obtain satisfactory performances of SVM. We tuned these parameters using a two-dimensional grid searching method over the range C = 2^-12 , 2^-10 ,…,2^2, γ = 2^-5 , 2^-3 ,…2^9. It should be noted that the parameter optimization was performed only using the training data in inner-loop. The Lin Chih-Jen's LIBSVM [22] was employed for both training and evaluation epitope prediction models.

Fivefold cross-validation
In order to estimate parameters in unbiased manner in the feature extraction, a stratified fivefold cross-validation tests were applied (Shown as in Figure 1). Specifically, the sample dataset was randomly divided into five subsets, and each contained an equal number of peptides so that the relative proportion of epitopes to non-epitopes was 1:1. One fifth of the dataset was used as a testing dataset which was not used in the feature selection, while the feature selection was done and the learner was trained with the other four fifths dataset. This procedure was repeated five times, each time choosing different subsets of the data for training and testing. The whole consideration of the five testing sets was the final estimated performance of the training dataset.

Performance evaluation
The threshold-dependent and threshold-independent measures were used to evaluate the performance of fivefold cross-validation on training and independent testing datasets. For threshold-dependent measures, we used four types of commonly used parameters to evaluate the performances of prediction algorithms in the experiment, the prediction accuracy (A CC ), sensitivity (S en ), specificity (S pe ), and Matthews correlation coefficient (M CC ). The M CC measure has a value in the range from −1 to +1, and the closer the value to +1, the better the predictor is. A CC , S en , S pe , and M CC are defined as follows: TP, FP, TN, and FN are abbreviated for the number of true positive sample, false positive sample, true negative sample, and false negative sample, respectively.
Threshold-dependent measures are likely to increase the number of true positives of the classifier at the expense of increasing in false positive, and they are often employed to access the performances of machine learning methods. However, threshold-dependent measures are difficult to access the overall performance of B-cell linear epitopes prediction. Receiver operating characteristic (ROC) curves can define the performance of a classifier for a threshold-independent method over all possible thresholds. Area under curve (AUC) measures discrimination ability of correctly classifying B-cell linear epitopes and non-epitopes. Any classifier performing better than random will have an AUC value that lies between 0.5 and 1.

Identification of optimal parameters
AUC value was used to find the optimal combinations of parameters. For each combination of I (=2, 3, 4) and p (<0.2, 0.4, 0.5, 0.6, 1) on BCI727 dataset at window size 20, AUC value was calculated by fivefold cross-validation. The epitope and non-epitope sequence features were generated from the peptide sequence using APC at window size of 20. Then, we removed the noise APC by feature selection using t-test. The dimensional reduced APC were used for SVMs trainings and model evaluation. The results of a serial AUC values were shown in Table 2. The results indicated that I = 3 setting had greater AUC values than those I = 2 and I = 4 settings, and the differences were statistically significant (Wilcoxon test, p-values were 0.02895 and 0.03125 respectively). While for I = 3, the results illustrated that the optimal p-value was at p < 0.5. Therefore, the optimal parameters for APCpred model development were I = 3 and p < 0.5.

Construction of prediction model for B-cell epitopes
We used the dataset BCI727 to evaluate the performances of APCpred. First, the epitope and non-epitope sequence features were generated from the peptide sequence using APC (I = 3). Then, the noise APC was removed by feature selection using t-test (p < 0.5) on training dataset. The dimensional reduced APC were used for SVMs trainings and model evaluation. The performance of APCpred at different window lengths (12,14,16,18,20) on the BCI727 dataset was shown in Table 3, which indicated that the best performance was at the window size 20 with AUC = 0.748 and accuracy (Acc) = 68.43%. ROC plot for different window sizes was shown in Figure 3.

Assessing APCpred model building method using different datasets
In Chen's report [11], AAP propensity scale was used in combination with a support vector machine (SVM) to construct a model which achieved optimal accuracy of 71.09% on Chen872 using fivefold cross-validation at window size 20. Further, they  combined the AAP scale and five amino acid propensity (AP) scales using the SVM classifier in order to improve the prediction accuracy and achieved Acc of 72.54%. In Wee's report [13], the method Bayesb only achieved an accuracy of 68.50% and AUC of 0.74 on Chen's dataset. In this study, we used fivefold cross-validation on the Chen872 dataset to compare APCpred (I = 3 and p < 0.5) with Bayesb, Chen's AAP and the combination method of AAP and AP ( Table 4). The result showed that APCpred method achieved better performance with AUC = 0.809 and Acc = 72.94% comparing to Bayesb (Acc = 68.50%) and Chen's two methods (Acc = 71.09% and 72.54%).
We further evaluated the APCpred (I = 3 and p < 0.5) performance with ABC16 and Blind387 datasets. In the previous study, BCpred and AAP BCpred had a comparison with ABCpred [12]. BCpred and AAP BCpred were proven to outperform over ABCpred on the fivefold cross-validation of ABC16, but both methods failed in improving the prediction of the independent dataset Blind387. Using publicly available benchmark datasets, we were also able to compare APCpred with ABCpred, BCpred and AAP BCpred . First, we tested fivefold cross-validation on ABCP16 dataset to compare APCpred with ABCpred, BCpred and AAP BCpred , the result were summarized in Table 5. In terms of overall accuracy, the performance of APCpred was more accurate than ABCpred, but less accurate than BCpred and AAP BCpred on fivefold cross-validation of ABC16 dataset. However, in terms of overall AUC values, AUC of APCpred was only less than BCpred (ABCpred AUC was unknown). These results showed that APCpred also improved the performance of fivefold cross-validation on ABC16 compared with ABCpred.
The classifier built from ABC16 dataset was used to predict the liner B-cell epitopes from an independent dataset for validation. The prediction accuracy was then used to compare APCpred (I = 3 and p < 0.5) with other current prediction methods. The performances of the four classifiers trained with ABC16 dataset, and then tested with the independent dataset Blind387. The results were summarized in Table 6. In this case,   [14] to test the variable length linear B-cell epitopes on FBC934 dataset, the result was summarized in Table 7. The APCpred accuracy is 55.09%, which is better than 52.66% from the model LBtope. Therefore, APCpred also improved the prediction on this dataset.

Discussion
B-cell linear epitopes are short sequences on the antigenic proteins, which contain structure characters to exposure themselves to antibodies, and easily bind to the antibodies, even if they are disengaged from the source proteins. In order to have antigenic functions, epitope sequences must be different from the random sequences generated from Swissprot database. We believe that B-cell linear epitopes sequences must fold into a stable structure to show the sequences' information for being bound to antibodies. We propose that the amino acid anchoring pairs play important roles in stabilizing folding of epitopes structure by producing the force for folding in three-dimensional spaces. Thus, in this paper, we studied the roles of amino acid pairs in prediction of B-cell linear epitopes. Since it has been reported that 86.7% epitopes' length was at most 20 amino acids in Bcipep database [12], during dealing with the large variability in the length of the epitopes, we chose to fix length of epitopes with lengths ranging from 12 to 20 peptides in the method of El-Manzalawy [12] and Saha [10], instead of windows of five or seven amino acids at the center of a linear epitope as Parker [5] and Karplus [6] did. The existing B-cell linearepitope finding methods are far less than optimal or may only find part of epitope sequences, which may indicate that the prediction methods based on composition of The four classifiers were trained using ABC16 dataset and evaluated using the third dataset of Blind287. "*" denotes the information was obtained on online prediction of ABCpred with the third dataset though an automatic program script. The bold denotes the largest A CC value of the prediction.