This article has Open Peer Review reports available.
Feature selection for gene prediction in metagenomic fragments
© The Author(s) 2018
Received: 14 January 2018
Accepted: 1 May 2018
Published: 7 June 2018
Computational approaches, specifically machine-learning techniques, play an important role in many metagenomic analysis algorithms, such as gene prediction. Due to the large feature space, current de novo gene prediction algorithms use different combinations of classification algorithms to distinguish between coding and non-coding sequences.
In this study, we apply a filter method to select relevant features from a large set of known features instead of combining them using linear classifiers or ignoring their individual coding potential. We use minimum redundancy maximum relevance (mRMR) to select the most relevant features. Support vector machines (SVM) are trained using these features, and the classification score is transformed into the posterior probability of the coding class. A greedy algorithm uses the probability of overlapped candidate genes to select the final genes. Instead of using one model for all sequences, we train an ensemble of SVM models on mutually exclusive datasets based on GC content and use the appropriated model to classify candidate genes based on their read’s GC content.
Our proposed algorithm achieves an improvement over some existing algorithms. mRMR produces promising results in gene prediction. It improves classification performance and feature interpretation. Our research serves as a basis for future studies on feature selection for gene prediction.
Metagenomics is the study of genetic information in uncultured organisms obtained directly from the environment [1–3]. The term metagenomics was coined in 1998 by Handelsman et al. as, the total genetic information of microbiota found in an environmental sample [4, 5]. Studies have shown that the number of species present in a metagenome can reach thousands of different species . Metagenomics analysis rely on different analysis pipelines in order to answer many questions such as identifying the organisms present in a given sample and what are these organisms doing.
Gene prediction is a fundamental step in most metagenomics analysis pipelines . Gene prediction is the process of locating genes in genomic sequences [6, 7]. Initially, studies identified genes through reliable experiments on living cells and organisms. However, it is usually an expensive and time-consuming task . Computational approaches are the most commonly used method for finding genes as they have proven their effectiveness in identifying genes in both genomes and metagenomes at a fraction of the cost and time of experimental approaches. Computational approaches are divided into similarity-based and content-based approaches [9, 10]. Similarity-based methods identify genes by searching for similar existing sequences. Basic Local Alignment Search Tool (BLAST)  is used to search for similarities between a candidate gene and existing known genes. However, this approach is expensive and cannot be used to discover novel genes or species. Content-based methods try to overcome these limitations using statistical approaches to detect variations between coding and non-coding regions [1, 8]. While, these approaches are very successful in genomic sequences, there is still work to be done for metagenomics due to the nature of the data [6, 12]. The greatest challenges for gene prediction algorithms in metagenomics are the short read-length and the incomplete and fragmented nature of the data [1, 13].
Machine learning is widely and successfully used in metagenomics analysis . Various methods for predicting genes based on machine learning algorithms were developed. Orphelia , Metagenomic Gene Caller (MGC) , and MetaGUN  are examples of such tools. Orphelia is a web-based program designed to predict genes in short DNA sequences that have unknown phylogenetic origins. First, Orphelia extracts all open reading frames (ORFs) from input DNA sequences. Then, all ORFs are scored using the Orphelia gene prediction model, which consists of a two-stage machine learning approach. In the first stage, some features from the ORF are extracted using monocodon usage, dicodon usage and translation initiation sites . Then, linear discriminants are employed to reduce the feature space. In the second stage, a neural network is used to combine the features from the previous step with the ORF length and GC content of the read. The neural network approximates the probability that an ORF encodes a protein. Finally, a greedy method is used to select the most likely genes from the ORFs that overlap by at most 60 bases . MGC  is an improvement over Orphelia. The MGC algorithm uses the same two-stage machine learning approach but creates separate classification models based on several pre-defined GC-content ranges. It uses the appropriate model to predict genes in a fragment based on its GC content. Moreover, MGC uses two new features based on amino-acid usage in order to improve the overall gene prediction accuracy . The MGC method shows that the use of separate learning models instead of a single model improves gene prediction performance. Both Orphelia  and MGC  use a linear discriminant classifier as a feature selection method that combines a large number of features to produce new features.
Feature selection can be considered as a preprocessing technique that aims to improve the performance of the classification, reduce training and build time and help to understand the domain [18–20]. Feature selection methods can be classified as wrapper, filter, embedded and hybrid methods according to the way that learning algorithms select features [20, 21]. Wrapping methods use supervised learning approaches to validate feature sets. Therefore, wrapper methods are computationally expensive and do not scale well to high-dimensional data [20, 22, 23]. In addition, search overhead, overfitting and low generality are other disadvantages of wrapper methods . Filter methods use general characteristics of the dataset without the involvement of supervised learning algorithms [20, 23, 24]. Filter methods have more generality, require less computation and scale well to high-dimensional data [19, 20]. Hybrid methods combine filter and wrapper methods. For example, filter methods are used to select a specific number of features. Then, wrapper methods are applied to choose the final best features . Filter methods is more suitable in our problem, because there are large number of features.
In this paper, we introduce a content-based approach that uses machine learning techniques to predict genes in metagenomic samples. We introduce a new method that use recent feature selection technique mRMR instead of combining features from single source into a new feature.
GC content ranges
Number of ORFs
Gene bank accession no.
Number of ORFs
The proposed method
In order to distinguish coding from non-coding sequences, we extract commonly used features in gene prediction: mainly codon and amino acid usages [7, 15, 16]. In addition to combining these usages into small set of features, gene finders also use features related to the translation initiation sites (TIS) such as the position weight matrices (PWM) around candidate sites. However, since not all our candidate ORFs are complete, we will not extract any TIS related features but rather rely on post-processing techniques to correct the TIS in our predictions . The following shows the different categorizations of our features:
monocodon usage: The frequency of occurrences of each codon. Since there are 64 different codons, the monocodon usage produces 64 features.
dicodon usage: The frequency of pairs of successive half-overlapping codons. Dicodon usage produces 4,096 features.
monoamino acid usage: The frequency of occurrences of each amino acid . Since there are 20 amino acids, this usage produces 20 features.
diamino acid usage: The frequency of pairs of successive half-overlapping amino acids. Diamino acid usage produces 441 features.
In addition to features based on usage, we also consider the following three features:
ORF length ratios: The ratio between the length of the candidate ORF and its read length. Since we have complete and incomplete ORFs, we compute two features (complete length ratio and incomplete length ratio). If the candidate ORF is complete, then its incomplete length feature is set to zero and vice versa.
GC content: The percentage of cytosine and guanine in the read is assigned a feature for all candidate ORFs extracted from the particular read. Usually, coding regions have a higher GC content than non-coding regions .
Classification error rates vs. number of features
mRMR Feature-set size
where N+ and N− are the number of positive and negative samples, respectively.
Best RBF parameters for each GC range
Classification and post-processing
In this stage, all complete and incomplete ORFs are extracted from each input fragment. Based on the GC content of the fragment, appropriately 500 features are extracted from each ORF. These features are the same features that were used to build the model. Additionally, we extract three more features: GC content of the fragment, complete length, and incomplete length of the ORF. Then, the appropriate SVM model based on the GC content of the fragment is selected to score the ORF. The output from the SVM is the probability that a given ORF is a gene. We consider ORFs with probability greater than 0.5 as candidate genes. However, some of candidate genes can be overlapped and only one of them can be a gene. Genes in prokaryotes can maximally overlap by 45 bp . Thus, a greedy algorithm [15, 16] is used as a post-processing step to solve the overlap between candidate genes and select the final gene list. The candidate gene with highest probability is more likely to be a gene. Algorithm 1 describes the final candidate selection where g is the final gene list for a particular fragment and C contains the candidate list. To allow for direct comparison with other algorithms, we set the maximum overlap o max to be the minimum gene length which is 60 bp. The last step is to run the post-processing tool to correct the TIS, such as MetaTISA .
Results and discussion
Comparison of SVM and neural network on testing data
Comparison of our method, orphelia, MGC and prodigal on testing data
Our method (SVM)
The aim of our study is to apply feature selection techniques to metagenomics gene prediction. The motivation for applying feature selection is to improve gene prediction, reduce computational time, and increase domain understanding. Overall, the results provide important insights into using feature selection techniques in gene prediction. The experiments show the power of the mRMR-SVM framework. Furthermore, our experiments show that only a small number of features among thousands contribute to accurate gene prediction. mRMR selects the top 500 features and creates a balance between prediction accuracy and computational cost. Our method outperforms Prodigal in terms of specificity, and the overall performance of our method is higher than some prominent gene prediction programs, such as Orphelia and MGC. Additionally, our method and MGC achieve better results than Orphelia because both methods use several pre-defined GC classification models instead of a single model. There are some differences between our method and MGC. First, MGC uses a linear discriminant classifier method. Our method uses mRMR, which selects the features that correlate the strongest with a classification variable and that are mutually different from one another. Second, our method uses the SVM classifier, while MGC uses neural networks. Third, MGC has a feature called the Translation Initiation Site (TIS) score. In our study, we pick the leftmost TIS of each ORF-set, because the next step is to use the MetaTISA program  to correct the TIS.
We investigate the use of feature selection in gene prediction for metagenomics fragments. This is an important step toward enhancing the gene prediction process. We use filter feature selection methods because they scale well for high-dimensional data. We propose applying the mRMR algorithm to our data to reduce features and then apply the SVM to find the gene probability. Future work will investigate the use of deep learning to predict genes in metagenomics fragments. Deep learning is successfully used in bioinformatics and is able to handle a large number of features.
The authors gratefully acknowledge use of the service of "SANAM" supercomputer at "King Abdulaziz City for Science and Technology" (KACST), Saudi Arabia.
This research project is supported by a grant from the "King Abdulaziz City for Science and Technology" (KACST), Saudi Arabia (Grant No. 1-17-02-001-0025).
Availability of data and materials
The datasets for training and testing are available in the Orphelia website http://orphelia.gobics.de/datasets.jsp.
AA and AE conceived of the project. AA designed and implemented the work. AE helped in the design and provided expert input. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010; 6(2):1000667.View ArticleGoogle Scholar
- Thomas T, Gilbert J, Meyer F. Metagenomics-a guide from sampling to data analysis. Microb Inform Experimentation. 2012; 2(1):3.View ArticleGoogle Scholar
- Bashir Y, Pradeep Singh S, Kumar Konwar B. Metagenomics: An application based perspective. Chin J Biol. 2014; 2014.Google Scholar
- Di Bella JM, Bao Y, Gloor GB, Burton JP, Reid G. High throughput sequencing methods and analysis for microbiome research. J Microbiol Meth. 2013; 95(3):401–14.View ArticleGoogle Scholar
- Handelsman J. Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev. 2004; 68(4):669–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Sharpton TJ. An introduction to the analysis of shotgun metagenomic data. Front Plant Sci. 2014; 5.Google Scholar
- Jones NC, Pevzner P. An Introduction to Bioinformatics Algorithms, 1st edn; 2004.Google Scholar
- Angelova M, Kalajdziski S, Kocarev L. Computational methods for gene finding in prokaryotes. ICT Innovations. 2010:11–20.Google Scholar
- Mathé C, Sagot M-F, Schiex T, Rouzé P. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 2002; 30(19):4103–17.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomics Proteomics Bioinform. 2004; 2(4):216–21.View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.View ArticlePubMedGoogle Scholar
- Rangwala H, Charuvaka A, Rasheed Z. Machine learning approaches for metagenomics. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer: 2014. p. 512–5.Google Scholar
- Hyatt D, LoCascio PF, Hauser LJ, Uberbacher EC. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics. 2012; 28(17):2223–30.View ArticlePubMedGoogle Scholar
- Soueidan H, Nikolski M. Machine learning for metagenomics: methods and tools. Metagenomics. 2017; 1(1).Google Scholar
- Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinformatics. 2008; 9(1):217.View ArticlePubMedPubMed CentralGoogle Scholar
- El Allali A, Rose JR. Mgc: a metagenomic gene caller. BMC Bioinformatics. 2013; 14(9):6.Google Scholar
- Liu Y, Guo J, Hu G, Zhu H. Gene prediction in metagenomic fragments based on the svm algorithm. BMC Bioinformatics. 2013; 14(5):12.View ArticleGoogle Scholar
- Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014; 40(1):16–28.View ArticleGoogle Scholar
- Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: ICML, vol. 1: 2001. p. 74–81.Google Scholar
- Asir D, Appavu S, Jebamalar E. Literature review on feature selection methods for high-dimensional data. Int J Comput Appl. 2016; 136(1):9–17.Google Scholar
- Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–17.View ArticlePubMedGoogle Scholar
- Saeys Y, Degroeve S, Aeyels D, Rouzé P, Van de Peer Y. Selecting relevant features for gene structure prediction. In: Proceedings of Benelearn 2004. VUB Press: 2004. p. 103–9.Google Scholar
- Yu L, Liu H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In: ICML, vol. 3: 2003. p. 856–63.Google Scholar
- Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection–a comparative study. In: Intelligent Data Engineering and Automated Learning-IDEAL 2007: 2007. p. 178–87.Google Scholar
- Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. Genbank. Nucleic Acids Res. 2013; 41(D1):36–42.View ArticleGoogle Scholar
- Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009; 37(suppl 2):101–5.View ArticleGoogle Scholar
- Hu G-Q, Guo J-T, Liu Y-C, Zhu H. Metatisa: metagenomic translation initiation site annotator for improving gene start prediction. Bioinformatics. 2009; 25(14):1843–5.View ArticlePubMedGoogle Scholar
- Goés F, Alves R, Corrêa L, Chaparro C, Thom L. A comparison of classification methods for gene prediction in metagenomics. In: The International Workshop on New Frontiers in Mining Complex Patterns (NFmcp). The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD). Nancy: 2014.Google Scholar
- Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27(8):1226–38.View ArticlePubMedGoogle Scholar
- Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005; 3(02):185–205.View ArticlePubMedGoogle Scholar
- Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers. 1999; 10(3):61–74.Google Scholar
- Warren AS, Setubal JC. The genome reverse compiler: an explorative annotation tool. BMC Bioinformatics. 2009; 10(1):35.View ArticlePubMedPubMed CentralGoogle Scholar
- Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11(1):119.View ArticlePubMedPubMed CentralGoogle Scholar