- Open Access
- Open Peer Review
Feature selection for gene prediction in metagenomic fragments
BioData Mining volume 11, Article number: 9 (2018)
Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences.
In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content.
Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.
Metagenomics is the study of genetic information in uncultured organisms obtained directly from the environment [1–3]. The term metagenomics was coined in 1998 by Handelsman et al. as, the total genetic information of microbiota found in an environmental sample [4, 5]. Studies have shown that the number of species present in a metagenome can reach thousands of different species . Metagenomics analysis rely on different analysis pipelines in order to answer many questions such as identifying the organisms present in a given sample and what are these organisms doing.
Gene prediction is a fundamental step in most metagenomics analysis pipelines . Gene prediction is the process of locating genes in genomic sequences [6, 7]. Initially, studies identified genes through reliable experiments on living cells and organisms. However, it is usually an expensive and time-consuming task . Computational approaches are the most commonly used method for finding genes as they have proven their effectiveness in identifying genes in both genomes and metagenomes at a fraction of the cost and time of experimental approaches. Computational approaches are divided into similarity-based and content-based approaches [9, 10]. Similarity-based methods identify genes by searching for similar existing sequences. Basic Local Alignment Search Tool (BLAST)  is used to search for similarities between a candidate gene and existing known genes. However, this approach is expensive and cannot be used to discover novel genes or species. Content-based methods try to overcome these limitations using statistical approaches to detect variations between coding and non-coding regions [1, 8]. While, these approaches are very successful in genomic sequences, there is still work to be done for metagenomics due to the nature of the data [6, 12]. The greatest challenges for gene prediction algorithms in metagenomics are the short read-length and the incomplete and fragmented nature of the data [1, 13].
Machine learning is widely and successfully used in metagenomics analysis . Various methods for predicting genes based on machine learning algorithms were developed. Orphelia , Metagenomic Gene Caller (MGC) , and MetaGUN  are examples of such tools. Orphelia is a web-based program designed to predict genes in short DNA sequences that have unknown phylogenetic origins. First, Orphelia extracts all open reading frames (ORFs) from input DNA sequences. Then, all ORFs are scored using the Orphelia gene prediction model, which consists of a two-stage machine learning approach. In the first stage, some features from the ORF are extracted using monocodon usage, dicodon usage and translation initiation sites . Then, linear discriminants are employed to reduce the feature space. In the second stage, a neural network is used to combine the features from the previous step with the ORF length and GC content of the read. The neural network approximates the probability that an ORF encodes a protein. Finally, a greedy method is used to select the most likely genes from the ORFs that overlap by at most 60 bases . MGC  is an improvement over Orphelia. The MGC algorithm uses the same two-stage machine learning approach but creates separate classification models based on several pre-defined GC-content ranges. It uses the appropriate model to predict genes in a fragment based on its GC content. Moreover, MGC uses two new features based on amino-acid usage in order to improve the overall gene prediction accuracy . The MGC method shows that the use of separate learning models instead of a single model improves gene prediction performance. Both Orphelia  and MGC  use a linear discriminant classifier as a feature selection method that combines a large number of features to produce new features.
Feature selection can be considered as a preprocessing technique that aims to improve the performance of the classification, reduce training and build time and help to understand the domain [18–20]. Feature selection methods can be classified as wrapper, filter, embedded and hybrid methods according to the way that learning algorithms select features [20, 21]. Wrapping methods use supervised learning approaches to validate feature sets. Therefore, wrapper methods are computationally expensive and do not scale well to high-dimensional data [20, 22, 23]. In addition, search overhead, overfitting and low generality are other disadvantages of wrapper methods . Filter methods use general characteristics of the dataset without the involvement of supervised learning algorithms [20, 23, 24]. Filter methods have more generality, require less computation and scale well to high-dimensional data [19, 20]. Hybrid methods combine filter and wrapper methods. For example, filter methods are used to select a specific number of features. Then, wrapper methods are applied to choose the final best features . Filter methods is more suitable in our problem, because there are large number of features.
In this paper, we introduce a content-based approach that uses machine learning techniques to predict genes in metagenomic samples. We introduce a new method that use recent feature selection technique mRMR instead of combining features from single source into a new feature.
To evaluate the effectiveness of our approach, we use two datasets, including one for training and the other for testing the models. The training data were used by Orphelia  and the MGC . The training data contain 7 million ORFs in 700 bp fragments that were excised from 131 fully sequenced prokaryotic genomes (bacterial and archaeal)  and their gene annotations that obtained from GenBank . The 700 bp fragments are randomly excised to create 1-fold genome coverage from each training genome and 5-fold coverage for each genome in the testing dataset. Previous research prove the effectiveness of the use of separate learning model for several pre-defined GC ranges in increasing the prediction accuracy . The training dataset was split into 10 mutually exclusive parts based on pre-defined GC ranges as shown in Table 1. Each GC range contains around 700,000 sequences (ORFs). We used 100,000 sequences for feature selection and 600,000 sequences were used to build the classification model. The testing data contain three archaeal genomes and eight bacterial genomes. Table 2 lists the genomes that were used in testing with the number of ORFs in each genome. ORFs are extracted from fragments and classified into coding and non-coding based on annotations in the genomes. We refer to both complete and incomplete ORFs simply as ORFs. Complete ORFs are sequences that contain both start and stop codons, while incomplete ORFs are either missing upstream codons, downstream codons or both .
The proposed method
Our proposed method consists of four phases: feature extraction, feature selection, training, classification and a post processing phase, as shown in Fig. 1. First, we extract a large number of features from each candidate ORF. Using mRMR, we select the effective and relevant features which are then used by SVM classifier to approximate the posterior probability of each ORF coding for gene. Finally, we use a greedy approach to select the final gene list.
In order to distinguish coding from non-coding sequences, we extract commonly used features in gene prediction: mainly codon and amino acid usages [7, 15, 16]. In addition to combining these usages into small set of features, gene finders also use features related to the translation initiation sites (TIS) such as the position weight matrices (PWM) around candidate sites. However, since not all our candidate ORFs are complete, we will not extract any TIS related features but rather rely on post-processing techniques to correct the TIS in our predictions . The following shows the different categorizations of our features:
monocodon usage: The frequency of occurrences of each codon. Since there are 64 different codons, the monocodon usage produces 64 features.
dicodon usage: The frequency of pairs of successive half-overlapping codons. Dicodon usage produces 4,096 features.
monoamino acid usage: The frequency of occurrences of each amino acid . Since there are 20 amino acids, this usage produces 20 features.
diamino acid usage: The frequency of pairs of successive half-overlapping amino acids. Diamino acid usage produces 441 features.
In addition to features based on usage, we also consider the following three features:
ORF length ratios: The ratio between the length of the candidate ORF and its read length. Since we have complete and incomplete ORFs, we compute two features (complete length ratio and incomplete length ratio). If the candidate ORF is complete, then its incomplete length feature is set to zero and vice versa.
GC content: The percentage of cytosine and guanine in the read is assigned a feature for all candidate ORFs extracted from the particular read. Usually, coding regions have a higher GC content than non-coding regions .
Our method uses minimum redundancy maximum relevance (mRMR)  as it scales well to high-dimensional data and has promising results under different applications [29, 30]. In 2005, Peng et al. proposed the mRMR filter-method which aims to select features that are maximally dissimilar to each other but as similar as possible to the classification variable . Since our data are continuous, we use the F-test as a relevance measure and the Pearson correlation among variables as a redundancy measure. The maximum relevance of feature set S for class c is defined by the average value of all F-test values between the individual feature i and the class c as follows:
The minimum redundancy of all features in feature set S is defined by the average value of all Pearson correlations between the feature i and the feature j as follows:
where c(i,j) represents the Pearson correlation coefficient. We use an F-test with a correlation quotient (FCQ) as the mRMR optimization condition that combines the two above criteria of maximal relevance and minimal redundancy as follows:
We explore different mRMR feature sizes and compute k-nearest neighbor classification error rates for each feature set. Table 3 shows the error rate for different number of features. The results are based on sequences with GC content between 50.40 and 55.90. We select the top 500 features for the next phase of our algorithm since the classification gain from 300 to 500 features is not significant (less than.1%). For each GC range, we repeated the same experiment and selected the best 500 features to train SVM models.
We train support vector machines to produce the posterior probabilities of the coding class. The posterior probability P(class/input) is the probability of a class given evidence input. This technique was first introduced by Platt in 1999 . Platt proposed a method to extract posterior probability from the SVM output. In binary classification with two classes +1 and -1, for input x, the posterior probability is calculated by the following formula:
where A and B are two scalar parameters that are learned by the algorithm and function f is the output from the SVM. Moreover, Platt  suggests transforming the label y to target probabilities t+ for positive samples and t− for negative samples:
where N+ and N− are the number of positive and negative samples, respectively.
In order to train the SVM, we first partition the training data to 10 mutually exclusive partitions using pre-defined GC ranges. The GC ranges were decided by simply dividing the training data into 10 partitions. Previous research shows that using smaller ranges does not improve the prediction . Thus, an ensemble of SVM models based on GC content is used for classification. Each SVM model will produce the probability that a given ORF is a gene, as shown in Fig. 1. Prior to training the SVM models we use a grid search approach in order to tune the parameters for the RBF kernel of the SVM classifier. Table 4 presents the best cost and gamma for each GC range. The best features from mRMR and the best parameters from the grid search are used to create the final SVM models.
Classification and post-processing
In this stage, all complete and incomplete ORFs are extracted from each input fragment. Based on the GC content of the fragment, appropriately 500 features are extracted from each ORF. These features are the same features that were used to build the model. Additionally, we extract three more features: GC content of the fragment, complete length, and incomplete length of the ORF. Then, the appropriate SVM model based on the GC content of the fragment is selected to score the ORF. The output from the SVM is the probability that a given ORF is a gene. We consider ORFs with probability greater than 0.5 as candidate genes. However, some of candidate genes can be overlapped and only one of them can be a gene. Genes in prokaryotes can maximally overlap by 45 bp . Thus, a greedy algorithm [15, 16] is used as a post-processing step to solve the overlap between candidate genes and select the final gene list. The candidate gene with highest probability is more likely to be a gene. Algorithm 1 describes the final candidate selection where g is the final gene list for a particular fragment and C contains the candidate list. To allow for direct comparison with other algorithms, we set the maximum overlap o max to be the minimum gene length which is 60 bp. The last step is to run the post-processing tool to correct the TIS, such as MetaTISA .
Results and discussion
Gene prediction performance is measured by comparing the model prediction with the true gene annotation in fragments that were obtained from GenBank . Then, we count the number of true positives, false positives, and false negatives. True positive (TP) means the ORF correctly matched at least 60 bp in the same reading frame of annotated gene. False positive (FP) means the predicted ORF is incorrectly identified as a gene. False negative (FN) means an overlooked gene is incorrectly identified as non-coding. Then, we compute sensitivity, specificity, and harmonic means:
In this section, we compare the performance of the proposed mRMR-SVM algorithm with the mRMR-neural network algorithm for all 11 datasets presented in Table 2. Table 5 presents the comparison of the SVM and neural networks using the testing data. The results, as shown in Table 5, indicate that the SVM has a higher average harmonic mean than the neural networks: 92.17% and 90.33%, respectively. Based on these results, we select the mRMR-SVM algorithm for comparison with state-of-the-art algorithms.
We compare our algorithm with state-of-the-art algorithms, namely, Orphelia , MGC , and Prodigal , using the testing data. We test 10 random replicas per genome and then, we compute the mean and standard deviation for the specificity, sensitivity, and harmonic mean for each genome. Table 6 and Fig. 2 present the comparison results. Our method achieves an average specificity of 96.53%, a sensitivity of 88.31%, and a harmonic mean of 92.17%. Our algorithm outperforms Prodigal in terms of specificity, but Prodigal outperforms our algorithm in terms of sensitivity and harmonic mean. In addition, our algorithm outperforms Orphelia and MGC in terms of specificity, sensitivity, and harmonic mean. Meanwhile, our method outperforms Orphelia by an average of 11% and MGC by an average of 1%. Our experiments confirm the MGC hypothesis that building several models based on several GC content is better than building a single model.
The aim of our study is to apply feature selection techniques to metagenomics gene prediction. The motivation for applying feature selection is to improve gene prediction, reduce computational time, and increase domain understanding. Overall, the results provide important insights into using feature selection techniques in gene prediction. The experiments show the power of the mRMR-SVM framework. Furthermore, our experiments show that only a small number of features among thousands contribute to accurate gene prediction. mRMR selects the top 500 features and creates a balance between prediction accuracy and computational cost. Our method outperforms Prodigal in terms of specificity, and the overall performance of our method is higher than some prominent gene prediction programs, such as Orphelia and MGC. Additionally, our method and MGC achieve better results than Orphelia because both methods use several pre-defined GC classification models instead of a single model. There are some differences between our method and MGC. First, MGC uses a linear discriminant classifier method. Our method uses mRMR, which selects the features that correlate the strongest with a classification variable and that are mutually different from one another. Second, our method uses the SVM classifier, while MGC uses neural networks. Third, MGC has a feature called the Translation Initiation Site (TIS) score. In our study, we pick the leftmost TIS of each ORF-set, because the next step is to use the MetaTISA program  to correct the TIS.
We investigate the use of feature selection in gene prediction for metagenomics fragments. This is an important step toward enhancing the gene prediction process. We use filter feature selection methods because they scale well for high-dimensional data. We propose applying the mRMR algorithm to our data to reduce features and then apply the SVM to find the gene probability. Future work will investigate the use of deep learning to predict genes in metagenomics fragments. Deep learning is successfully used in bioinformatics and is able to handle a large number of features.
Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010; 6(2):1000667.
Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data analysis. Microb Inform Experimentation. 2012; 2(1):3.
Bashir Y, Pradeep Singh S, Kumar Konwar B. Metagenomics: An application based perspective. Chin J Biol. 2014; 2014.
Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G. High throughput sequencing methods and analysis for microbiome research. J Microbiol Meth. 2013; 95(3):401–14.
Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004; 68(4):669–85.
Sharpton TJ. An introduction to the analysis of shotgun metagenomic data. Front Plant Sci. 2014; 5.
Jones NC, Pevzner P. An Introduction to Bioinformatics Algorithms, 1st edn; 2004.
Angelova M, Kalajdziski S, Kocarev L. Computational methods for gene finding in prokaryotes. ICT Innovations. 2010:11–20.
Mathé C, Sagot M-F, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002; 30(19):4103–17.
Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomics Proteomics Bioinform. 2004; 2(4):216–21.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.
Rangwala H, Charuvaka A, Rasheed Z. Machine learning approaches for metagenomics. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer: 2014. p. 512–5.
Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2012; 28(17):2223–30.
Soueidan H, Nikolski M. Machine learning for metagenomics: methods and tools. Metagenomics. 2017; 1(1).
Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics. 2008; 9(1):217.
El Allali A, Rose JR. Mgc: a metagenomic gene caller. BMC Bioinformatics. 2013; 14(9):6.
Liu Y, Guo J, Hu G, Zhu H. Gene prediction in metagenomic fragments based on the svm algorithm. BMC Bioinformatics. 2013; 14(5):12.
Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014; 40(1):16–28.
Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: ICML, vol. 1: 2001. p. 74–81.
Asir D, Appavu S, Jebamalar E. Literature review on feature selection methods for high-dimensional data. Int J Comput Appl. 2016; 136(1):9–17.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–17.
Saeys Y, Degroeve S, Aeyels D, Rouzé P, Van de Peer Y. Selecting relevant features for gene structure prediction. In: Proceedings of Benelearn 2004. VUB Press: 2004. p. 103–9.
Yu L, Liu H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In: ICML, vol. 3: 2003. p. 856–63.
Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection–a comparative study. In: Intelligent Data Engineering and Automated Learning-IDEAL 2007: 2007. p. 178–87.
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Res. 2013; 41(D1):36–42.
Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009; 37(suppl 2):101–5.
Hu G-Q, Guo J-T, Liu Y-C, Zhu H. Metatisa: metagenomic translation initiation site annotator for improving gene start prediction. Bioinformatics. 2009; 25(14):1843–5.
Goés F, Alves R, Corrêa L, Chaparro C, Thom L. A comparison of classification methods for gene prediction in metagenomics. In: The International Workshop on New Frontiers in Mining Complex Patterns (NFmcp). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD). Nancy: 2014.
Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27(8):1226–38.
Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005; 3(02):185–205.
Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers. 1999; 10(3):61–74.
Warren AS, Setubal JC. The genome reverse compiler: an explorative annotation tool. BMC Bioinformatics. 2009; 10(1):35.
Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11(1):119.
The authors gratefully acknowledge use of the service of "SANAM" supercomputer at "King Abdulaziz City for Science and Technology" (KACST), Saudi Arabia.
This research project is supported by a grant from the "King Abdulaziz City for Science and Technology" (KACST), Saudi Arabia (Grant No. 1-17-02-001-0025).
Availability of data and materials
The datasets for training and testing are available in the Orphelia website http://orphelia.gobics.de/datasets.jsp.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.