iSuc-ChiDT: a computational method for identifying succinylation sites using statistical difference table encoding and the chi-square decision table classifier

Background Lysine succinylation is a type of protein post-translational modification which is widely involved in cell differentiation, cell metabolism and other important physiological activities. To study the molecular mechanism of succinylation in depth, succinylation sites need to be accurately identified, and because experimental approaches are costly and time-consuming, there is a great demand for reliable computational methods. Feature extraction is a key step in building succinylation site prediction models, and the development of effective new features improves predictive accuracy. Because the number of false succinylation sites far exceeds that of true sites, traditional classifiers perform poorly, and designing a classifier to effectively handle highly imbalanced datasets has always been a challenge. Results A new computational method, iSuc-ChiDT, is proposed to identify succinylation sites in proteins. In iSuc-ChiDT, chi-square statistical difference table encoding is developed to extract positional features, and has a higher predictive accuracy and fewer features compared to common position-based encoding schemes such as binary encoding and physicochemical property encoding. Single amino acid and undirected pair-coupled amino acid composition features are supplemented to improve the fault tolerance for residue insertions and deletions. After feature selection by Chi-MIC-share algorithm, the chi-square decision table (ChiDT) classifier is constructed for imbalanced classification. With a training set of 4748:50,551(true: false sites), ChiDT clearly outperforms traditional classifiers in predictive accuracy, and runs fast. Using an independent testing set of experimentally identified succinylation sites, iSuc-ChiDT achieves a sensitivity of 70.47%, a specificity of 66.27%, a Matthews correlation coefficient of 0.205, and a global accuracy index Q9 of 0.683, showing a significant improvement in sensitivity and overall accuracy compared to PSuccE, Success, SuccinSite, and other existing succinylation site predictors. Conclusions iSuc-ChiDT shows great promise in predicting succinylation sites and is expected to facilitate further experimental investigation of protein succinylation. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-022-00290-1.


Background
Protein post-translational modifications (PTMs) regulate cellular physiology and significantly increase protein diversity and complexity. Lysine succinylation is an evolutionarily conserved PTM present in both prokaryotic and eukaryotic cells where a succinyl group is covalently bonded to specific lysine residues by enzymatic or non-enzymatic processes [1,2]. Succinylation can promote remarkable changes in protein structure and function, and plays a role in many diseases, such as tuberculosis [3], allergic dermatitis [4], and inflammation [5]. Therefore, elucidating the molecular mechanism of succinylation will provide valuable information for both biomedical research and drug development.
Accurate identification of succinylation sites is critical to succinylation research, and because experimental methods are costly and time-consuming, and have been unable to keep up with the exponential growth of the number of sequenced proteins, efficient in silico methods are in great demand. To date, many predictors for identifying succinylation sites have been developed, such as SucPred [6], SuccinSite [7], pSuc-Lys [8], PSuccE [9] and so on, but with their limited overall accuracy and poor sensitivity, numerous true succinylation sites remain undetected. Actually, what interested us more is the information on true succinylation sites. Therefore, it is necessary to further improve predictive accuracy, especially sensitivity. Two key components, feature extraction and classifier construction, can greatly affect the accuracy of a computational method.
Commonly used features include positional features [7,[9][10][11], sequence composition [7][8][9][10][11], evolutionary information [12][13][14], and protein secondary structure [13][14][15]. Positional information of amino acids is basic but important to a protein sequence. While binary encoding [7,9] is the most intuitive method to extract positional features, the feature matrix is very sparse. The binary encodings are the same for the same residue at different positions, and so it cannot reflect positional differences. Physicochemical property encoding [7,9,11] is another position-based amino acid encoding scheme that is frequently used. The AAindex [16] database records 531 physicochemical properties of 20 standard amino acids. Since it is not known in advance which physicochemical properties are related to classification, physicochemical property encoding means each position needs to be represented by 531 physicochemical properties, resulting in many irrelevant and redundant features.
Traditional classifiers including support vector machine (SVM) [6,[9][10][11]13], random forest (RF) [7,8] and decision tree [12,15] have been applied in succinylation site prediction. The number of false succinylation sites (non-succinylated lysine residues) far exceeds that of true sites, for example, the dataset from Hasan et al. [7] contains 5004/ 53524 true/false succinylation sites (a ratio of positive to negative samples of about 1: 10). Training any traditional classifier with such highly imbalanced datasets could strongly bias classification results [17], and the large number of training samples would make the training time of some classifiers (e.g. SVM) unbearable. To address this, some methods (e.g. SucPred, SuccinSite) balanced the class distribution by under-sampling the negative samples, but this might lead to the loss of some potential classification information due to the mass discarding of negative samples; some methods (e.g. pSuc-Lys, PSuccE) designed classifier ensemble algorithms, however, they were still integrated results of several individual classifiers trained with a balanced subset where positive samples were repeatedly used.
Based on a highly imbalanced dataset, we developed an efficient approach called iSuc-ChiDT for predicting succinylation sites. Firstly, the 2 × 20 contingency table of each position was compressed based on local chi-square tests, and then the 9 key positions and a window size of 16 residues were determined. Next, chi-square statistical difference table encoding was used to characterize the 9 key positions, and amino acid composition (AAC) and undirected pair-coupled amino acid composition (undirected-PCAAC) features were incorporated. After applying the Chi-MIC-share [18] algorithm for feature selection, the ChiDT classifier was finally designed to achieve imbalanced classification. The flow chart of our method is shown in Fig. 1.

Datasets
From Uni-ProtKB/Swiss-Prot [19] database and NCBI protein sequence database [20], Ning et al. [9] obtained 2322 succinylated proteins with 5009 experimentally verified lysine succinylation sites by applying a 30% homology-reducing screening procedure with CD-HIT [21]. Then 124 succinylated proteins were randomly selected to build an independent testing set, and the remaining 2198 succinylated proteins were used as a training set. In this study, we used the same training and independent testing dataset as in Ning et al., which were freely available via the web link [22]. Our training set, namely Tr_data, contains 4748/50,551 true/false succinylation sites; and our testing set, namely Te_data, contains 254/2977 true/false succinylation sites.
Each true/false succinylation site was represented by a sequence fragment with an initial length of 51 amino acid residues, where the candidate site (lysine residue) was at the central position 0, and the upstream positions were successively labeled as − 1, − 2, …, − 25, and the downstream positions labeled 1, 2, …, 25. If the number of up-or downstream residues of the candidate site was less than 25, amino acids were created through mirror extension to make up the difference [8]. For example, the original sequence of the succinylated protein "SP-P0ABS8" is "MLKNLAKLDQTEMDKVNVDLAAAGVAFKE …" . The first lysine (K) is the candidate site and therefore the sequence fragment generated by The maximal information coefficient (MIC) [23] is a novel measure proposed to capture dependences between paired variables. The MIC score ranges from 0 to 1, and only approaches 0 if two variables are statistically independent. To calculate the MIC score of the paired variables x and y, the ApproxMaxMI [23] algorithm sets the n x × n y < B(n), where B(n) = n 0.6 is the maximal grid size restriction, and n is the sample size, and n x , n y are the number of partition bins on x and y, respectively. The MIC score for two independent variables calculated by ApproxMaxMI depends on the ratio between B(n) and n [24], and it is close to 0 only when n approaches infinity. For two independent variables under finite samples (especially for small sample size), ApproxMaxMI leads to a large deviation between the calculated MIC score and 0, meaning that the MIC will capture false associations. To address this drawback, Chen et al. [25] proposed an improved algorithm, ChiMIC [25], which uses local chi-square test to determinate optimal bin size for the calculating of MIC score. For two independent variables with 100 sample points, ApproxMaxMI tends to fall into the maximal grid size (100 0.6 ≈ 16), and the corresponding grid partition is a 2 × 8 grid, and the MIC score is 0.24. With ChiMIC, the MIC score is only 0.06, and the corresponding grid partition is a 2 × 2 or 2 × 3 grid. This shows that the grid partition searched by ChiMIC is more reasonable and that compressing a 2 × 8 grid into a 2 × 2 or 2 × 3 grid is wise.
Similarly, for each position in succinylation site-containing sequences, we can construct a 2 × 20 contingency table by respectively counting the occurrence frequencies of the 20 standard amino acids in the positive and negative samples. For instance, Fig. 2 gives the 2 × 20 table of position − 10 in Tr_data. What we need to investigate is whether the 2 × 20 Table (2 × 20 grid) is reasonable, and could it be compressed into a 2 × 10, or even a 2 × 2 table? A similar attempt was made in donor splice site prediction. For each position in donor site-containing sequences, a 2 × 4 contingency table can be built by counting the frequencies of 4 bases in the positive and negative samples. Following on from ChiMIC, Zeng et al. [26] compressed the 2 × 4 table of each position into a 2 × l (2 ≤ l ≤ 4) table using local chi-square test, and developed a highperformance approach to predict donor splice sites based on this compression strategy.
Encouraged by the successful application of the compression strategy on nucleotide sequences, we applied it to protein sequences. For the 2 × 20 contingency table for each position in succinylation site-containing sequences, the compression procedure is described below.
Step 1: Set the initial value of r (r is an integer) to 20.
Step 2: The 2 × r contingency table is compressed by merging two columns corresponding to two different residues, and some 2 × (r-1) contingency tables are obtained, then select a 2 × (r-1) contingency table with the maximum chi-square value, denoted as max 2 × (r-1) .
Step 3: A local 2 × 2 contingency table is constructed based on the merged residues in max 2 × (r-1) and perform a chi-square test. If the p-value is lower than a given threshold, max 2 × (r-1) is unreasonable and will be backtracked to the 2 × r contingency table and the compression procedure is terminated. If the p-value is greater than a given threshold, max 2 × (r-1) is reasonable, and a further compression of max 2 × (r-1) is attempted following these steps: 1) set r = r-1; 2) if r ≥ 3, repeat Step 2~3; otherwise, terminate compression.
Tc − 10 in Tr_data as an example, its 2 × 20 contingency table was finally compressed into a 2 × 3 table (Fig. 2). The 20 original status values of position − 10 were therefore turned into 3 status values, i.e., "ARDGTV", "NQEHILKM" and "CFPSWY", where, "ARDGTV" indicated that A, R, D, G, T, V at position − 10 were regarded as the same status value, and the others were similar.

Key positions selection and window size determination
For each position in the sequences with 51 residues, a 2 × r (2 ≤ r ≤ 20) contingency table can be obtained after compression based on the training set. A chi-square test was then performed on the 2 × r contingency table and the corresponding chi-square value was calculated. Higher chi-square values indicate that the corresponding positions

Positional feature extraction
A new position-based amino acid encoding scheme, chi-square statistical difference table encoding, was developed for position characterization. For the 9 key positions in each sequence sample, we extracted 9 positional features based on chi-square statistical difference table encoding, denoted as P − 8 , P − 4 , P − 3 , P − 2 , P − 1 , P 1 , P 2 , P 5 and P 7 respectively, where, P − 8 represents the positional feature of position − 8, P − 4 represents the positional feature of position − 4, and so forth.
In the training set, the occurrence frequencies of the 20 standard amino acids were counted at the i th (i = 1, 2, …, 9) position in the positive and negative samples, and then a 2 × 20 contingency table was built (Table 1).
In Table 1, f þ i; j represents the frequency of the j th (j = 1,2, …,20) residue at the i th position in the positive samples, f − i; j represents the corresponding frequencies in the negative samples, f þ i and f − i represent the total number of positive and negative samples, and N represents the total number of samples. The chi-square value corresponding to the i th position is calculated by: If a new training sample is added, and the j th residue appears at the i th position, first assume this sample is positive, replace f þ i; j with f þ i; j þ 1, and calculate a chi-square value  (1); then assume this sample is negative, replace f − i; j with f − i; j þ 1, and calculate a chi-square value χ 2− i; j using formula (1). The score for the chi-square statistical difference table with the j th residue at the i th position is defined as: Next, build a 20 × 9 chi-square statistical difference table (Table 2). Table 2 gives the scores of the various amino acid residues at each position. If the j th residue appears at the i th position, the i th positional feature will be assigned a value of Δχ 2 i; j . Table S1 (Additional file 1) shows the 20 × 9 chi-square statistical difference table constructed based on 9 key positions in Tr_data.

Compositional feature extraction
Despite positional features could distinguish highly similar positive and negative samples, they lack fault tolerance when there are residue insertions and deletions in protein sequences. Compositional features can capture the context correlation while reflecting sequence composition, and they are more fault-tolerant. Giving an example as follows: Original sequence segment: The mutated sequence segment is caused by the insertion of residue E at position 4 of the original sequence segment. Obviously, the positional features changed a lot after residue inserting, but the sequence components changed little. Therefore, we supplement compositional features in hope of improving the algorithm's robustness to residue insertions and deletions.
For each sequence sample with a window size of 16 residues, 230 compositional features were extracted, including 20 AAC features and 210 undirected-PCAAC features.  The AAC features are defined as the occurrence frequencies of the 20 standard amino acids in the sequence, respectively denoted as f A , f R , …, f V , where, f A represents the frequency of alanine (A), f R represents the frequency of arginine (R), and so forth.
The individual amino acid components are independent of each other, so the AAC features cannot reflect any correlation between amino acids. The pair-coupled amino acid composition [27] (PCAAC) features are composed of the occurrence frequencies of pairwise coupling between two adjacent residues, which can reflect both sequence components and the most preliminary association effect. To reduce feature dimension and solve the sparse problem of feature matrix, we assume that the pairwise coupling has no direction. For example, A-R coupling is treated the same as R-A coupling, and the corresponding pair-coupled component will be expressed by either f AR or f RA , where f AR (f RA ) is the sum of AR pair occurrence frequency and RA pair occurrence frequency found in a sequence.

Feature selection based on chi-MIC-share
In order to eliminate irrelevant features and redundant features in the original feature set and reduce the number of features, we decided to perform feature selection. Minimum redundancy maximum relevance (mRMR) [28] is a popular feature selection method. However, relevance measure and redundancy measure in mRMR are not comparable, mRMR only gives the order of feature introduction and it is timeconsuming to perform cross-validation in training sets to get the optimal feature subset. To address this, Li et al. [18] used ChiMIC as the unified measure of relevance and redundancy, and designed a redundancy sharing strategy to propose a novel feature selection method, Chi-MIC-share. Here, we applied Chi-MIC-share for feature selection.
Given an original feature set Ω = {X 1 , X 2 , …, X i , …, X n }, |Ω| is the number of elements in Ω, and |Ω| = n. If the introduced feature set is represented by S, the complement of S is represented as Ω S = Ω-S. Denoting the response variable as Y, the Chi-MIC-share algorithm is described as follows.
For an introduced feature X i in S, the score after redundancy sharing is calculated by: The total score of all features in S after redundancy sharing is: If the next introduced feature is X next , set E = S+{X next }, then |E| = |S| + 1. The criterion for introducing the next optimal feature is: If a new introduced feature no longer makes the total Chi-MIC-share score increase, this feature will be discarded and feature selection will be automatically terminated. Thus, the criterion for terminating feature introduction is: Furthermore, feature introduction can be forced to terminate according to the following criterion: Classifier construction To efficiently achieve imbalanced classification, a classifier called ChiDT is designed as follows.
Compress the 2 × m contingency

Introduce the retained features one by one
Supposing the proportion of the k th class samples in sample set D is p k (k = 1, 2), the information entropy of D is defined as: Given a Chi-MIC-share retained feature X i , supposing it has r new status values as {s 1 , s 2 , …, s j , …, s r } after compressing, then the information gain that X i brings for D can be calculated by: where D j represents the samples in D whose X i takes the status value as s j (1 ≤ j ≤ r), while H (D j ) is the information entropy of D j . From the features whose information gains are above the average, pick out the one with the highest gain ratio to be the first introduced feature. Here, the gain ratio of X i is defined as: where and IV (X i ) is the intrinsic value of X i . Next, the remaining features are introduced one by one with the following steps.
Step 1: Under the conditions in which the introduced features have existed, the 2 × r contingency table of each remaining feature is further compressed. If the r columns of the 2 × r contingency table are compressed into one column, the remaining feature cannot be introduced; if the r columns are not compressed into one column, the remaining feature will be considered as a candidate feature to be introduced.
Step 2: Calculate the information gain of every candidate feature. From the candidate features whose information gains are above the average, the one with the highest gain ratio is selected to be the next introduced feature.
Step 3: Repeat Step 1~2 until no further features can be introduced. After this, the introduced features with their status values generate various rules. Taking Tr_data as an example, 10 Chi-MIC-share retained features were finally introduced and 137 rules were generated (see Additional file 2: Table S2).

Construct a balanced decision table for decision-making
We counted the number of positive and negative training samples conforming to each rule then constructed a 2 × 137 imbalanced decision table (Table 3).
The number of negative samples far exceeds the positives. To resolve the imbalanced classification problem, based on cost-sensitive learning [29], we adjust the decision weight of negative samples in each column of the imbalanced decision table, by multiplying the number of negative samples in each column of Table 3 by θ, where θ is defined as the ratio of the total number of positive and negative training samples, here, θ = 4748/50551. Then, a 2 × 137 balanced decision table is obtained (Table 4).
Then the balanced decision table (Table 4) is used for decision-making. Suppose that a testing sample meets the rule "(P -1 =-2.028) ˄ (-0.907≤P 2 ≤0.501) ˄ (-0.715≤P -8 ≤0.066)". First, we assume that it is positive and replace 23 with 23 + 1, then calculate the corresponding chi-square value χ 2 þ . We then assume that it is negative and replace 273.04 with 273.04 + 1, then calculate the corresponding chi-square value χ 2 − . If χ 2 þ > χ 2 − , the testing sample is predicted to be positive, if not, it is negative.

Performance evaluation
Sensitivity (SN), specificity (SP) and Matthews correlation coefficient (MCC) as the common indexes for evaluating binary classification are defined as follows: Here, TP, FP, TN, and FN denote the numbers of true positives, false positives, true negatives, and false negatives. MCC is a balanced statistical index that considers SN and SP, but it is sensitive to class distribution in a testing set. As shown in Table 5, when a prediction model has a SN of 93% and a SP of 95%, as the imbalance degree of the testing set grows, the MCC value declines. This shows that a low MCC value does not always indicate poor prediction performance as it may be caused by a highly imbalanced testing set.
The content-balancing accuracy index Q 9 [30] is independent of the class distribution of the dataset and has been widely used to evaluate performance of many prediction programs including gene-finding, splice site prediction and protein secondary structure prediction [31][32][33]. As Table 5 shown, the value of Q 9 remains unchanged across different ratios of positives to negatives. In this study, we introduced Q 9 as the measure of global accuracy to evaluate the prediction performance of models in case of an imbalanced testing set. Q 9 is defined as: where The value of Q 9 ranges from 0 to 1, and the larger the Q 9 value, the better the prediction performance.

Results and discussion
Features retained by chi-MIC-share Based on Tr_data, the Chi-MIC-share feature selection was performed on 239 original input features (9 positional features and 230 compositional features). As shown in Fig. 4, when the 37 th feature was introduced, the Chi-MIC-share score peaked (0.12544), after which is began to decline and feature selection was automatically terminated. To improve computational efficiency, forced termination criteria were adopted and 10 features were retained (see the red line in Fig. 4). Table 6 describes the retained features in detail. It can be seen that positional features contribute 80% of all the retained features, indicating that positional features have an important contribution to succinylation site prediction.
No. denotes the order of feature introduction.

Comparison of different classifiers
Based on the same input features (10 features retained by Chi-MIC-share), the ChiDT classifier was compared to traditional classifiers including RF, artificial neural network (ANN) and relaxed variable kernel density estimator (RVKDE) [34]. We choose RVKDE for the comparison because it delivers the same level of accuracy as SVM when the number of training samples exceeds 10,000, with a significantly lower average time complexity of O (nlogn) [34], n denotes the number of training samples. RF and ANN classifiers were built with Weka 3.8.1 and the neural network toolbox in Matlab R2015a, respectively, and all parameters took the default values. The independent tests based on Tr_data and Te_data were performed for comparison ( Table 7). The results show that: 1) ChiDT achieved a significantly higher predictive accuracy and effectively realized imbalanced classification. When the training set was imbalanced (4748 positives/50,551 negatives), the prediction results of RF, ANN and RVKDE were biased to negative samples, resulting in poor sensitivities (SNs < 10%).  With ANN, while specificity was up to 99.9%, sensitivity was equal to 0. This meant that all positive samples were predicted to be negative and thus the global accuracy of ANN was the lowest (Q 9 = 0.293). ChiDT built a balanced decision table through weighted correction strategy to perform imbalanced classification and obtained the highest accuracy (Q 9 = 0.683). 2) ChiDT has a satisfactory calculation speed and can be applied to large samples. All simulations were run on an Intel Core i5-3320M 2.6 GHz/ 8 GB RAM system, and the elapsed time of ChiDT and RVKDE were 17 s and 18 min, respectively. ChiDT's high speed is achieved because there is no need for parameter optimization.

Comparison of different position-based encoding schemes
For the 9 key positions in Tr_data (here using 4748 positive samples and 4748 negative samples), we respectively used binary encoding, physicochemical property encoding (including 531 physicochemical properties [7] and 10 physicochemical properties [9] for encoding) and chi-square statistical difference table encoding to extract positional features, and then employed ChiDT classifier for prediction. The results of 5-fold cross validation showed that chi-square statistical difference table encoding achieved the highest predictive accuracy and the fewest features (Table 8). Binary encoding means that each position is represented by 20 0/1-features and the corresponding feature matrix is therefore very sparse. When using binary encoding scheme, the encodings of the same residue at different positions are the same, which does not reflect positional difference, and for different residues at the same position, it does not reflect the degree of difference between residues. For example, the amino acid polarity indexes of residue S, T and R at the same position are 1.67, 1.66 and 52, respectively, indicating that the polarity difference between S and T is small, and between T and R is large, but the hamming distances of both S-T and S-R are equal to 2 when using binary encoding. As for physicochemical property encoding, when 531 amino acid indices in AAindex were all considered for sequence characterization, the number of features reached 531 × 9 = 4779 (Table 8), and a lot of irrelevant and redundant features would be seen. Ning et al [9] ranked 531 physicochemical properties according to their abilities to distinguish between true and false succinylation sites, then chose the top 10 physicochemical properties for sequence encoding, so that the feature dimension was greatly reduced. However, as shown in Table 8, no matter whether 531 or 10 physicochemical properties are used, the predictive accuracy is always lower than that of chi-square statistical difference table encoding. Chi-square statistical difference table encoding reflects the difference of the same residue at different positions, as well as the degree of difference between different residues at the same position, thus, it could differentiate between the highly similar positive and negative samples. Another benefit of chi-square statistical difference table encoding is that it has a low feature dimension, low redundancy, and a non-sparse feature matrix.

Comparison of different window sizes
Based on Tr_data and Te_data, independent tests were performed to compare the prediction performance of the determined window size (− 8~+ 7) with longer (e.g.  (Table 9) show that the proposed model with a window size of 16 residues (− 8~+ 7) can achieve higher independent test accuracy compared to other window sizes. This indicates that an overly long window size may introduce some irrelevant information, while too short a window may lead to insufficient information collection, both of which reduce predictive accuracy. This confirms that our window size determination is reliable.

Necessity of chi-MIC-share feature selection
Based on Tr_data and Te_data, the independent test results with or without Chi-MICshare feature selection are shown in Table 10. They show that feature selection based on Chi-MIC-share can: 1) improve predictive accuracy, with the Q 9 value improving from 0.663 to 0.683, and 2) reduce feature dimension and save computational time. After feature selection, the number of original input features was reduced from 239 to 10, and the elapse time of ChiDT was reduced by 95%. Therefore, it is necessary and beneficial to perform a Chi-MIC-share feature selection.

Comparison with existing methods
To further evaluate the performance of our method (iSuc-ChiDT), we compared it with existing succinylation site predictors, SucPred, iSuc-PseAAC [10], SuccFind [11], SuccinSite, iSuc-PseOpt [35], pSuc-Lys, Success [13] and PSuccE, using the same independent testing set (Te_data). The results show that iSuc-ChiDT had a superior overall accuracy (Q 9 = 0.683) and sensitivity (70.47% vs. 12.20%~37.50%) ( Table 11). Positional information of amino acids is valuable for succinylation site prediction. Most compared methods used binary encoding or physicochemical property encoding to extract positional features. iSuc-ChiDT used chi-square statistical difference table encoding and the experiments showed that it was superior to these two encoding schemes (see Table 8). Moreover, iSuc-ChiDT combined positional features and compositional features to characterize samples. Using the independent tests based on Tr_ data and Te_data, the MCC values of 9 positional features, 230 compositional features and 239 combinational features-based models were 0.167, 0.099 and 0.182, respectively, confirming that feature fusion improved predictive accuracy. The ChiDT classifier outperformed traditional classifiers when dealing with imbalanced datasets (see Table 7), further supporting the observation that iSuc-ChiDT could achieve better prediction performance.

Conclusion
Accurate and rapid prediction of succinylation sites helps researchers to understand the molecular mechanism of succinylation. In this study, we proposed a novel method, iSuc-ChiDT, to identify succinylation sites by incorporating chi-square statistical difference table encoding and the ChiDT classifier. Chi-square statistical difference table encoding is superior to binary encoding and physicochemical property encoding in terms of predictive accuracy and feature dimensions. The ChiDT classifier achieves efficient prediction with a highly imbalanced dataset. iSuc-ChiDT greatly improved sensitivity and overall accuracy compared to previous predictors, and it will serve as an useful complementary tool for detecting potential succinylation sites in proteins. In future studies, we aim to explore more valuable features (e.g. evolutionary information, structural information) for characterizing succinylation sites, in pursuit of better prediction performance.