Feature selection for gene prediction in metagenomic fragments

Background Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences. Results In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content. Conclusion Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.

studies identified genes through reliable experiments on living cells and organisms. However, it is usually an expensive and time-consuming task [8]. Computational approaches are the most commonly used method for finding genes as they have proven their effectiveness in identifying genes in both genomes and metagenomes at a fraction of the cost and time of experimental approaches. Computational approaches are divided into similarity-based and content-based approaches [9,10]. Similarity-based methods identify genes by searching for similar existing sequences. Basic Local Alignment Search Tool (BLAST) [11] is used to search for similarities between a candidate gene and existing known genes. However, this approach is expensive and cannot be used to discover novel genes or species. Content-based methods try to overcome these limitations using statistical approaches to detect variations between coding and non-coding regions [1,8]. While, these approaches are very successful in genomic sequences, there is still work to be done for metagenomics due to the nature of the data [6,12]. The greatest challenges for gene prediction algorithms in metagenomics are the short read-length and the incomplete and fragmented nature of the data [1,13].
Machine learning is widely and successfully used in metagenomics analysis [14]. Various methods for predicting genes based on machine learning algorithms were developed. Orphelia [15], Metagenomic Gene Caller (MGC) [16], and MetaGUN [17] are examples of such tools. Orphelia is a web-based program designed to predict genes in short DNA sequences that have unknown phylogenetic origins. First, Orphelia extracts all open reading frames (ORFs) from input DNA sequences. Then, all ORFs are scored using the Orphelia gene prediction model, which consists of a two-stage machine learning approach. In the first stage, some features from the ORF are extracted using monocodon usage, dicodon usage and translation initiation sites [15]. Then, linear discriminants are employed to reduce the feature space. In the second stage, a neural network is used to combine the features from the previous step with the ORF length and GC content of the read. The neural network approximates the probability that an ORF encodes a protein.
Finally, a greedy method is used to select the most likely genes from the ORFs that overlap by at most 60 bases [15]. MGC [16] is an improvement over Orphelia. The MGC algorithm uses the same two-stage machine learning approach but creates separate classification models based on several pre-defined GC-content ranges. It uses the appropriate model to predict genes in a fragment based on its GC content. Moreover, MGC uses two new features based on amino-acid usage in order to improve the overall gene prediction accuracy [16]. The MGC method shows that the use of separate learning models instead of a single model improves gene prediction performance. Both Orphelia [15] and MGC [16] use a linear discriminant classifier as a feature selection method that combines a large number of features to produce new features.
Feature selection can be considered as a preprocessing technique that aims to improve the performance of the classification, reduce training and build time and help to understand the domain [18][19][20]. Feature selection methods can be classified as wrapper, filter, embedded and hybrid methods according to the way that learning algorithms select features [20,21]. Wrapping methods use supervised learning approaches to validate feature sets. Therefore, wrapper methods are computationally expensive and do not scale well to high-dimensional data [20,22,23]. In addition, search overhead, overfitting and low generality are other disadvantages of wrapper methods [20]. Filter methods use general characteristics of the dataset without the involvement of supervised learning algorithms [20,23,24]. Filter methods have more generality, require less computation and scale well to high-dimensional data [19,20]. Hybrid methods combine filter and wrapper methods. For example, filter methods are used to select a specific number of features. Then, wrapper methods are applied to choose the final best features [23]. Filter methods is more suitable in our problem, because there are large number of features.
In this paper, we introduce a content-based approach that uses machine learning techniques to predict genes in metagenomic samples. We introduce a new method that use recent feature selection technique mRMR instead of combining features from single source into a new feature.

Dataset
To evaluate the effectiveness of our approach, we use two datasets, including one for training and the other for testing the models. The training data were used by Orphelia [15] and the MGC [16]. The training data contain 7 million ORFs in 700 bp fragments that were excised from 131 fully sequenced prokaryotic genomes (bacterial and archaeal) [15] and their gene annotations that obtained from GenBank [25]. The 700 bp fragments are randomly excised to create 1-fold genome coverage from each training genome and 5-fold coverage for each genome in the testing dataset. Previous research prove the effectiveness of the use of separate learning model for several pre-defined GC ranges in increasing the prediction accuracy [16]. The training dataset was split into 10 mutually exclusive parts based on pre-defined GC ranges as shown in Table 1. Each GC range contains around 700,000 sequences (ORFs). We used 100,000 sequences for feature selection and 600,000 sequences were used to build the classification model. The testing data contain three archaeal genomes and eight bacterial genomes. Table 2 lists the genomes that were used in testing with the number of ORFs in each genome. ORFs are extracted from fragments and classified into coding and non-coding based on annotations in the genomes. We refer to both complete and incomplete ORFs simply as ORFs. Complete ORFs are sequences that contain both start and stop codons, while incomplete ORFs are either missing upstream codons, downstream codons or both [26].

The proposed method
Our proposed method consists of four phases: feature extraction, feature selection, training, classification and a post processing phase, as shown in Fig. 1. First, we extract a large

Features extraction
In order to distinguish coding from non-coding sequences, we extract commonly used features in gene prediction: mainly codon and amino acid usages [7,15,16]. In addition to combining these usages into small set of features, gene finders also use features related to the translation initiation sites (TIS) such as the position weight matrices (PWM) around candidate sites. However, since not all our candidate ORFs are complete, we will In addition to features based on usage, we also consider the following three features: ORF length ratios: The ratio between the length of the candidate ORF and its read length. Since we have complete and incomplete ORFs, we compute two features (complete length ratio and incomplete length ratio). If the candidate ORF is complete, then its incomplete length feature is set to zero and vice versa.
GC content: The percentage of cytosine and guanine in the read is assigned a feature for all candidate ORFs extracted from the particular read. Usually, coding regions have a higher GC content than non-coding regions [28].

Feature selection
Our method uses minimum redundancy maximum relevance (mRMR) [29] as it scales well to high-dimensional data and has promising results under different applications [29,30]. In 2005, Peng et al. proposed the mRMR filter-method which aims to select features that are maximally dissimilar to each other but as similar as possible to the classification variable [30]. Since our data are continuous, we use the F-test as a relevance measure and the Pearson correlation among variables as a redundancy measure. The maximum relevance of feature set S for class c is defined by the average value of all F-test values between the individual feature i and the class c as follows: The minimum redundancy of all features in feature set S is defined by the average value of all Pearson correlations between the feature i and the feature j as follows: where c(i, j) represents the Pearson correlation coefficient. We use an F-test with a correlation quotient (FCQ) as the mRMR optimization condition that combines the two above criteria of maximal relevance and minimal redundancy as follows: We explore different mRMR feature sizes and compute k-nearest neighbor classification error rates for each feature set. Table 3 shows the error rate for different number of features. The results are based on sequences with GC content between 50.40 and 55.90. We select the top 500 features for the next phase of our algorithm since the classification gain from 300 to 500 features is not significant (less than .1%). For each GC range, we repeated the same experiment and selected the best 500 features to train SVM models.

Training models
We train support vector machines to produce the posterior probabilities of the coding class. The posterior probability P(class/input) is the probability of a class given evidence input. This technique was first introduced by Platt in 1999 [31]. Platt proposed a method to extract posterior probability from the SVM output. In binary classification with two classes +1 and -1, for input x, the posterior probability is calculated by the following formula: where A and B are two scalar parameters that are learned by the algorithm and function f is the output from the SVM. Moreover, Platt [31] suggests transforming the label y to target probabilities t + for positive samples and t − for negative samples: where N + and N − are the number of positive and negative samples, respectively. In order to train the SVM, we first partition the training data to 10 mutually exclusive partitions using pre-defined GC ranges. The GC ranges were decided by simply dividing the training data into 10 partitions. Previous research shows that using smaller ranges does not improve the prediction [16]. Thus, an ensemble of SVM models based on GC content is used for classification. Each SVM model will produce the probability that a given ORF is a gene, as shown in Fig. 1. Prior to training the SVM models we use a grid search approach in order to tune the parameters for the RBF kernel of the SVM classifier. Table 4 presents the best cost and gamma for each GC range. The best features from mRMR and the best parameters from the grid search are used to create the final SVM models.

Classification and post-processing
In this stage, all complete and incomplete ORFs are extracted from each input fragment. Based on the GC content of the fragment, appropriately 500 features are extracted from each ORF. These features are the same features that were used to build the model. Additionally, we extract three more features: GC content of the fragment, complete length, and incomplete length of the ORF. Then, the appropriate SVM model based on the GC content of the fragment is selected to score the ORF. The output from the SVM is the probability that a given ORF is a gene. We consider ORFs with probability greater than 0.5 as candidate genes. However, some of candidate genes can be overlapped and only one of them can be a gene. Genes in prokaryotes can maximally overlap by 45 bp [32]. Thus, a greedy algorithm [15,16] is used as a post-processing step to solve the overlap between candidate genes and select the final gene list. The candidate gene with highest probability is more likely to be a gene. Algorithm 1 describes the final candidate selection where g is the final gene list for a particular fragment and C contains the candidate list. To allow for direct comparison with other algorithms, we set the maximum overlap o max to be the minimum gene length which is 60 bp. The last step is to run the post-processing tool to correct the TIS, such as MetaTISA [27].

Algorithm 1
The final gene candidate selection 1: Input: Candidate gene list C for a particular fragment 2: Output: The final gene list g for a particular fragment 3: while C is non-empty do 4: find i max =argmax i P i with respect to all ORFs i in C 5: move ORF i max from C to g 6: remove all ORFs in C that overlap with ORF i max by more than o max bp 7: end while

Performance measures
Gene prediction performance is measured by comparing the model prediction with the true gene annotation in fragments that were obtained from GenBank [25]. Then, we count the number of true positives, false positives, and false negatives. True positive (TP) means the ORF correctly matched at least 60 bp in the same reading frame of annotated gene. False positive (FP) means the predicted ORF is incorrectly identified as a gene. False negative (FN) means an overlooked gene is incorrectly identified as non-coding. Then, we compute sensitivity, specificity, and harmonic means:

Results
In this section, we compare the performance of the proposed mRMR-SVM algorithm with the mRMR-neural network algorithm for all 11 datasets presented in Table 2. Table 5 presents the comparison of the SVM and neural networks using the testing data. The results, as shown in Table 5, indicate that the SVM has a higher average harmonic mean than the neural networks: 92.17% and 90.33%, respectively. Based on these results, we select the mRMR-SVM algorithm for comparison with state-of-the-art algorithms. We compare our algorithm with state-of-the-art algorithms, namely, Orphelia [15], MGC [16], and Prodigal [33], using the testing data. We test 10 random replicas per genome and then, we compute the mean and standard deviation for the specificity, sensitivity, and harmonic mean for each genome. Table 6 and Fig. 2 present the comparison results. Our method achieves an average specificity of 96.53%, a sensitivity of 88.31%, and a harmonic mean of 92.17%. Our algorithm outperforms Prodigal in terms of specificity, but Prodigal outperforms our algorithm in terms of sensitivity and harmonic mean. In addition, our algorithm outperforms Orphelia and MGC in terms of specificity, sensitivity, and harmonic mean. Meanwhile, our method outperforms Orphelia by an average of 11% and MGC by an average of 1%. Our experiments confirm the MGC hypothesis that building several models based on several GC content is better than building a single model.

Discussion
The aim of our study is to apply feature selection techniques to metagenomics gene prediction. The motivation for applying feature selection is to improve gene prediction,  reduce computational time, and increase domain understanding. Overall, the results provide important insights into using feature selection techniques in gene prediction. The experiments show the power of the mRMR-SVM framework. Furthermore, our experiments show that only a small number of features among thousands contribute to accurate gene prediction. mRMR selects the top 500 features and creates a balance between prediction accuracy and computational cost. Our method outperforms Prodigal in terms of specificity, and the overall performance of our method is higher than some prominent gene prediction programs, such as Orphelia and MGC. Additionally, our method and MGC achieve better results than Orphelia because both methods use several pre-defined GC classification models instead of a single model. There are some differences between our method and MGC. First, MGC uses a linear discriminant classifier method. Our method uses mRMR, which selects the features that correlate the strongest with a classification variable and that are mutually different from one another. Second, our method uses the SVM classifier, while MGC uses neural networks. Third, MGC has a feature called the Translation Initiation Site (TIS) score. In our study, we pick the leftmost TIS of each ORF-set, because the next step is to use the MetaTISA program [27] to correct the TIS.

Conclusion
We investigate the use of feature selection in gene prediction for metagenomics fragments. This is an important step toward enhancing the gene prediction process. We use filter feature selection methods because they scale well for high-dimensional data. We propose applying the mRMR algorithm to our data to reduce features and then apply the SVM to find the gene probability. Future work will investigate the use of deep learning to predict genes in metagenomics fragments. Deep learning is successfully used in bioinformatics and is able to handle a large number of features.