Supervised DNA Barcodes species classification: analysis, comparisons and results
© Weitschek et al.; licensee BioMed Central Ltd. 2014
Received: 18 November 2013
Accepted: 5 April 2014
Published: 11 April 2014
Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.
In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.
A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.
The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.
KeywordsDNA Barcoding Supervised classification methods Species identification
In 2003 Hebert et al.  proposed DNA Barcoding as a technique to identify species. Specific fragments, coming from short portions of mitochondrial, nuclear and plastid DNA, have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. The following gene regions are chosen as Barcodes: cytochrome C Oxidase subunit I (COI) for animals , rbcL and matK for plants , and the Internal Transcribed Spacer (ITS) for fungi .
Taxonomists identify biological specimens by morphological features, however in some tough cases the identification becomes complex even for experts. DNA Barcoding solves this problem, because it is able to distinguish species and identify specimens (also incomplete, damaged or immature ones) using a very short gene sequence, that can be easily obtained from tiny amounts of tissue.
It is now recognized that a DNA Barcode provides the sufficient information needed to classify a specimen to species, showing an high variability even among closely related species [5, 6]. Thus, since 2004 the International Barcode Of Life project (IBOL) and the Consortium for the Barcode Of Life (CBOL) has promoted international initiatives devoted to the development of DNA Barcoding as a global standard for the identification of biological species, aiming to build up an online freely available sequence database (http://www.barcodinglife.org).
character-based methods (also called “diagnostic methods”).
Tree-based methods assign unidentified Barcodes (query) to species based on their membership of clusters in a DNA Barcode tree. This approach can be achieved, for example, with Parsimony (i.e., PAR ), or Neighbor Joining (i.e., NJ ), or Bayesian Inference . Similarity-based methods (e.g., BLAST , NN , and TaxonDNA ) assign query Barcodes to species based on how much DNA Barcode characters they have in common. Character-based methods (e.g., DNA-BAR , BLOG , CAOS , BRONX [17, 18], PTIGS-IdIt , Linker , Alignment-free analytics ) rely on the presence/absence of particular characters in DNA Barcode sequences for identification, instead of using them all .
The DNA Barcode classification problem may be approached as a supervised machine learning problem in the following way : given a reference library composed of DNA Barcode specimen sequences of known species and a collection of unknown DNA Barcode sequences (query set), recognize the latter into the species that are present in the library.
a set of training examples (in the following referred as a reference set) containing specimens with a priori known species membership and
a test set (in the following referred as a query set) containing specimens which require classification,
The paper  includes a high level description of some supervised machine learning methods (Nearest Neighbor, CART, Random Forest and Kernel Functions), but an analysis framework and software are not provided.
In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown, through the performance comparison with respect to ad-hoc DNA Barcode analysis methods. The Weka machine learning software , which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Different types of classifiers (trees, rules, lazy learners, Bayesian and functions) are tested on public available synthetic and empirical datasets belonging to the animal, plant, and fungus kingdoms. In particular, the function-based method Support Vector Machines (SMO), the rule-based RIPPER (Jrip), the decision tree C4.5 (J48), and the Bayesian-based method Naïve Bayes are considered.
The supervised machine learning algorithms
The Weka tools collection for Machine Learning and Data Mining analysis  is used to approach the species classification problem with DNA Barcode sequences. Weka (Waikato Environment for Knowledge Analysis) is a Java open source package that collects the most popular algorithms to handle classification, numeric prediction, or clustering problems. Among the several packages collected in Weka, the “Weka.classifier” package includes the implementation of classification and prediction algorithms, comprising the most important “Classifier” class. The latter defines the structure of any schema of classification or prediction assessment and it is made up by two methods, buildClassifier() and classifyIstances(), whose implementation is necessary for all supervised machine learning algorithms.
Weka algorithms collection
Support Vector Machines
Model tree generators
Regression via classification
Locally weighted regression
Locally weighted regression
Classification via regression
Cost sensitive classification
Kind of classification
Bayesian network (e.g., Naïve Bayes)
Linear regression, neural networks, support vector machine
Instance-based similarity (e.g., Nearest neighbor algorithm)
Bagging, boosting, stacking, regression through classification, classification through regression, cost sensitive classification
Tree classifier (e.g., decision tree)
Algorithms that handle multi-instance data
Various classifiers that do not fit in any another category
Among the Weka classifiers the following methods are tested on DNA Barcode sequences: (i) the function-based method Support Vector Machines (SMO) ; (ii) the rule-based RIPPER (Jrip) ; (iii) the decision tree C4.5 (J48) ; and (iv) the Bayesian-based method Naïve Bayes .
SMO  is the Weka implementation of the supervised learning function-based method Support Vector Machines (SVM). SMO is a discriminative classifier, that converts the reference data objects in multi-dimensional vectors and defines a separating hyperplane among the objects belonging to different classes, i.e., given labeled training data, the algorithm outputs an optimal hyperplane that separates the classes with the largest minimum distance. After a proper vector transformation, new objects from the query set are evaluated according to this separating hyperplane. For example, for a linearly separable set of 2D-points which belong to one of two classes, the SVM finds a separating line where points of the same class lie on the same half-space. One of the most relevant features of the SVM is to use a non-linear transformation of the input space in a very efficient way via a linear Kernel function. SMO performs usually with high classification accuracy, but its main drawback is that no human readable classification model is provided as output.
Jrip (RIPPER)  imp lements a propositional rule learner, Repeated Incremental Pruning to Produce Error Reduction, which was proposed by William W. Cohen. The algorithm performs two main phases: the first one builds an initial set of rules and the second one optimizes the rule set k times (typically k is set to 2). Specifically, the classes are examined in increasing size and an initial set of rules for each class is generated using incremental reduced error pruning. Then, all the examples of a particular judgment in the training data are treated as a class, and a set of rules that covers all the members of that class is found. Thereafter, the algorithm proceeds to the next class, repeating the same procedure until all classes have been covered. This method is a good candidate for DNA Barcoding as it provides a classification model composed of logic rules for each species in the dataset, that can be used to compactly characterize the analyzed specimens.
J48  is a supervised classification method belonging to the decision trees family. In particular, it represents the Weka implementation of the decision tree algorithm C4.5, that greedily looks for the best split and the best feature at each node in terms of the information gain measure. A decision tree is a simple tree structure whose non-terminal vertices represent tests on one or more attributes, while the terminal ones reflect the results of the decision. The key advantages of decision trees are the following: (i) they are simple and easily convertible into a set of rules; (ii) both numerical and categorical data can be classified (even if the output attribute must be categorical); (iii) there are no a priori assumptions about the nature of the features (e.g., qualitative, quantitative, ordinal data). However, decision trees are unstable (i.e., variations in the training data can produce different set of attributes to be chosen) and generally multiple output attributes are not allowed. Also in this method a classification model is given as output (the decision tree), which can be easily read as a set of logic rules composed by sequence positions and nucleotide assignments.
Naïve Bayes  is a Bayesian-based classifier using estimator classes. It is one of the most practical learning methods often used when a large reference set is available.
A Bayesian Network (BN) is the joint probability distribution of a set of variables: based on the state of the observable variables and a priori probabilities represented by the edge in the relations between variables, the a posteriori probabilities of the unknown states are evaluated. In this way, BN can be considered as a tool of investigation and forecasting. Mathematically, the BN is a directed acyclic graph whose vertices are variables or states, while the edges are statistical dependencies between the variables and local probability distributions of the leaf vertices compared to the values of the parent ones. The absence of an edge between two vertices reflects their conditional independence. Contrarily, the presence of an edge from a vertex X i to a vertex X j can be explained as X i is a direct cause of X j . The critical assumption of a Naïve Bayes classifier is the conditional independence of the set of attributes that describes each x ∈ X instance of the target function f: X → Y. Like in the SVM method, no clear classification model is provided to the investigator, who can only perform a “blind” assignment of specimen to species.
Input, sequences conversion and output
Note that for supervised machine learning methods, the sequences have to be of the same region or pre-aligned to the same region before being processed (e.g., sub-segments of COI or rbcL coding genes) .
The software converts FASTA format into the ARFF Weka format. The latter is composed of two parts. The first part of the file includes the name of the dataset (starting with “@relation”), the heading line (starting with “@attribute”) for each attribute (i.e., sequence position), where the type of attribute is specified (e.g., numeric, a number, or categorical, a string of characters) and finally a complete list of the species enclosed in curly brackets. The second part (starting with “@data”) comprises a line for each specimen, that stores the attribute values separated by a comma.
Weka supervised machine learning outputs are the classification accuracy rates of query and reference sequences, the classification models, e.g., decision trees, logic rules, etc., and the specimens to species assignments. Additional outputs can be obtained by setting specific Weka flags, see  and the user manual for further details.
Among the ad-hoc DNA Barcodes classification tools, a supervised machine learning method is called BLOG (Barcoding with LOGic) . It is a character-based method whose aim is to classify specimens to species using classification rules that compactly characterize species in terms of DNA Barcode locations of key diagnostic nucleotides. BLOG computes for each species in the reference set the distinctive nucleotide positions of the DNA Barcode sequences and the logic classification rules in the form of “if-then” that are able to characterize a species in a compact way. The classification rules can then be applied to a query set. An example of classification rule is “if pos40 = T and pos265 = T then the specimen is classified as Ompok bimaculatus”. For further details on BLOG the reader may refer to [7, 27, 28].
Limits of supervised methods
The following limits are identified when using supervised methods for species classification with DNA Barcode sequences:
a full reference set of specimens species is necessary; at least 4 specimens per species are suggested for building a reference library and the sequences of each species have to include possibly all the nucleotide polymorphisms (variations); the more specimens are available, the more accurate are the classification models, and subsequently the results;
when not using an adequate reference library, under-fitting or over-fitting phenomena may occur (under-fitting may be present when an insufficient number of specimens per species is given in the reference library, over-fitting when too many sequences of one or more species are present in the library and poor sampling is performed, i.e., not equal distributed specimens for each species);
scaling of algorithms is not warranted when dealing with thousands of species and millions of specimens; this problem may be solved by sampling, i.e., selecting only representative sequences for each species;
no support is provided for multi-locus DNA Barcode sequences.
Results and discussion
The classification comparative analysis is performed using a selection of published empirical datasets and synthetic DNA Barcode datasets taken from [7, 8, 27] and available for download at dmb.iasi.cnr.it/supbarcodes.php.
Public empirical datasets (available at GenBank Nucleotide Database) have been chosen with the following properties: (i) sequences with high phylogenetic diversity; (ii) identification complexity due to the absence of large inter-specific sequence differences; and (iii) selection of different genomic compartments in the sequences.
The eight selected empirical datasets, summarized in Table 3, are the following.
Cypraeidae: Cypraeidae (Mollusca) are taxonomically one of the most extensively studied marine gastropods. The dataset comprises 2,008 DNA Barcode sequences with a length of 618 bases and from 211 species, where 112 species are represented by 4 or more sequences.
Drosophila: Drosophila is a thoroughly studied dataset characterized by an high within-species divergence. The dataset is composed of 615 DNA Barcode sequences of 19 species; their sequence length is 663 bases and 15 species have more than five representing sequences.
Inga: Inga (Fabaceae) is a large genus of tropical leguminous trees. Lots of Inga species collected in southwestern Amazon are sorted in an incomplete DNA Barcode tree. The dataset is made up of 913 DNA Barcodes of length 1,838. Such sequences come from 56 species, 35 are represented by more than five sequences.
Bats: The Bats dataset is composed of 826 barcode sequences from specimens belonging to 82 different species. The sequences are taken from BOLD (Barcode Of Life Database)  and come from the Kingdom Animalia, the Phylum Chordata, the Class Mammalia, the Infraclass Eutheria, the Superorder Laurasiatheria and the Order Chiroptera.
Fishes: The Fishes dataset is composed of 626 recent barcode sequences from specimens belonging to 82 different species. The Barcode sequences are obtained from GenBank Nucleotide Database and mainly taken from the Kingdom Animalia, the Phylum Chordata belonging to the commonly known paraphyletic group of the fishes.
Birds: The Birds dataset is composed of 1,700 Barcode sequences from individuals that belong to 150 different species. Each fragment contains between 648 and 690 nucleotides. It was provided by the CBOL in the 2007 Conference (http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges2007).
Summary of the empirical datasets
The sequences of the empirical selected datasets are divided into a reference set (80% per species), including the sequences with a priori assigned species membership, and a query set (20% per species), comprising also the DNA Barcode sequences with an a priori assigned species label (but not considered by the algorithm) for an evaluation of the classification success. Also the synthetic DNA Barcode sequences are divided into reference dataset and query dataset, which include 16 and 4 sequences for species, respectively. It is worth noting that since species membership of query dataset is simulated together with the reference dataset, they are also known, allowing a posteriori evaluation of their identification accuracy.
The samplings, i.e., the divisions of reference and query set, are performed according to the same data splits present in previous works [7, 8, 27] for allowing a comparison of the classification results. These data splits were performed by biologists in , following specific sequence compositions (e.g., polymorphism) and challenges (e.g., low species divergences, not equal-distributed specimen for each species, and high intra-species variability). Moreover, when possible each dataset is composed of species with 5 or more representing sequences in the reference library.
A typical experimentation procedure is described in this section. Moreover, a comprehensive tutorial that guides the user during the software package downloads, set up, and the execution of the experiments on its own datasets is provided as Additional file 1.
the sequences are acquired from dmb.iasi.cnr.it/supbarcodes.php;
each dataset (reference and query) is converted in Weka ARFF format with the special converter described previously in the Input, sequences conversion and output section;
the supervised machine learning algorithms C4.5, Naïve Bayes, RIPPER, and SVM are run in Weka;
the specimen to species classification accuracies and the classification models are evaluated.
the sequences are acquired from dmb.iasi.cnr.it/supbarcodes.php;
each dataset (reference and query) is converted in Weka ARFF format with the special converter described previously in Input, sequences conversion and output section;
the supervised machine learning algorithms C4.5, Naïve Bayes, RIPPER, and SVM are run in Weka 100 times on different reference – query splits; special scripts for performing a batch classification analysis in Weka have been implemented and are available upon request;
the specimen to species classification accuracies and the classification models are evaluated;
the average classification accuracies of the 100 runs are computed.
Moreover, the Multi-Layer Perceptron method  has been tested, however it required a very high running time, not providing the demanded output even after hours of computation. Therefore, the results have been not considered in the comparison.
To evaluate the performances of the algorithms, accuracy and standard deviation, both weighted by the number of samples for each dataset, are considered. In addition, as statistical test of differences among algorithms, the pairwise Wilcoxon signed rank test based on paired observations  has been performed.
The supervised classification algorithms are tested using both the standard configuration and a comprehensive parameter tuning (see the following Comparative Analysis subsection for the obtained results). Specifically, the standard parameters for each analyzed method are listed in Additional file 2: Table S1.
Empirical sequences: classification analysis and results
Eight empirical DNA Barcode sequence datasets have been analyzed for classification according to the steps described in the previous section.
Accuracies for the empirical datasets [%]
The detailed results of the supervised machine learning tested methods are shown for the eight empirical datasets and the performances on query set and reference set for each selected empirical dataset are drawn in Additional file 2: Figures S1-S8. Each figure depicts results on empirical data through histograms that provide the accuracy rate for all analyzed methods on the query set (panel (a) of each picture) and on the reference set (panel (b) of each picture).
Synthetic sequences: classification analysis and results
Three synthetic DNA Barcode sequence datasets have been analyzed for the classification according to the steps described in section Experimental settings.
Accuracies for the synthetic datasets [%]
The results on the synthetic data are largely consistent with results on the empirical ones: SVM and Naïve Bayes outperform the other methods. The statistical significance (p-value ≤ 0.001) is proven by performing the pairwise Wilcoxon test among SVM (Naïve Bayes) and the other algorithms with a Bonferroni correction  in order to consider the high numbers of comparisons. In this case, also the performance difference between SVM and Naïve Bayes is statistically significant (p-value ≤ 0.001).
The detailed performances are reported in Additional file 2: Figure S9, S10 and S11. Each figure depicts results on synthetic data through histograms and bar-plots, in order to highlight the averaged performances (panels (b) and (d) of each picture) together with the standard deviation (panels (a) and (c) of each picture).
A comparative evaluation of the classification results is performed (i) using several machine learning algorithms from the collection of Weka classifiers; (ii) using these algorithms with different parameter configurations; and (iii) comparing the results with ad-hoc and well-established DNA Barcode classification techniques, as phylogenetic trees (NJ, PAR), similarity-based (BLAST), and character-based (DNA-BAR, BLOG) methods. The results are compared evaluating accuracy and standard deviation, both weighted by the number of samples for each dataset.
Supervised machine learning algorithm comparisons
The different Weka supervised machine learning algorithms are run on empirical and synthetic data according to the steps previously described in section Experimental setting.
Default versus different parameter configurations of Weka classifiers
Different parameter settings of the supervised machine learning algorithms in Weka have been tested on empirical data according to the steps described in section Experimental settings. The standard classification performances of machine learning methods on three selected empirical datasets (i.e., Cypraeidae, Drosophila and Inga) are compared with respect to the ones obtained using other parameter configurations (listed in Additional file 2: Table S2, S3, S4 for Cypraeidae, Drosophila and Inga, respectively). The results of the comparative analysis for the three empirical datasets are shown in Additional file 2: Figure S12-S14. No relevant differences among the analyzed configurations appear, except for the configuration of Drosophila and Inga when SVM uses a Logistic Model. Only three datasets are taken as representative samples and analyzed using different parameters, as the classification results do not substantially change when performing parameters tuning.
Weka algorithms versus DNA Barcodes ad-hoc classification methods
In this experimentation the empirical and synthetic datasets (Cypraeidae, Drosophila, and Inga) have been analyzed with Weka supervised machine learning algorithms according to the steps described in section Experimental settings and their accuracy has been compared to previous results presented in .
Summarizing, on synthetic data the supervised machine learning methods outperform the ad-hoc DNA Barcode classification methods (Figure 6), although not all of them results statistically significant according to the Wilcoxon test. On empirical data the classification performances are comparable to the ad-hoc methods (Figure 5). The empirical datasets taken into account for this comparison are only the Cypraeidae, Drosophila, and Inga sequences, as tested in previous studies . It is not surprising that ad-hoc DNA Barcodes classification methods have slightly weaker performances on synthetic data, as the sequences are generated to challenge these methods.
This paper provides a comprehensive approach to the problem of assigning an unknown specimen to a known species by analyzing its DNA Barcode. Such a task was addressed using supervised classification algorithms implemented by the software tool Weka. In particular, specific classifiers like the function-based method Support Vector Machines (SVM), the rule-based RIPPER (Jrip), the decision tree C4.5, and the Bayesian-based method Naïve Bayes were tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. Additionally, an integrated tool that converts the DNA Barcode FASTA sequences to the Weka format was developed in order to adapt different input formats and hence to allow the experiments execution.
Furthermore, the classification results were compared with respect to ad-hoc and well-established DNA Barcode classification techniques, as phylogenetic trees (NJ, PAR), similarity-based (BLAST), and character-based (DNA-BAR, BLOG) methods. The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcode species classification problem, obtaining excellent classification performances. On empirical data the classification performances were comparable to the traditional DNA Barcode classification methods, while on synthetic data higher classification performances have been obtained. The results presented in this paper and those available in previous literature establish the extensive validity of the application of supervised learning methods for species classification with DNA Barcodes, testing both the accuracy of different methods and of different dataset types. Finally, a powerful tool and pipeline to perform species classification are provided to the DNA Barcoding community.
An extension of the supervised classification procedure is planned as future work, where the issue of specimen to species assignments with multi-locus DNA Barcode sequences will be analyzed and addressed.
The authors gratefully thank Paola Bertolazzi and Paolo Atzeni for permitting this work, Robin Van Velzen for generating the synthetic sequences in previous studies, and the organizing committee of the 5th international Barcode of Life conference in Kunming (Yunnan, China).
This work is partially supported by the FLAGSHIP “InterOmics” (PB.P05), the “Epigen” project funded by the Italian MIUR and CNR institutions and by the GenData 2020 PRIN project.
- Hebert PDN, Cywinska A, Ball SL, DeWaard J: Biological identifications through DNA barcodes. Proc R Soc B. 2003, 270: 313-321. 10.1098/rspb.2002.2218.View ArticlePubMedPubMed CentralGoogle Scholar
- Hebert PDN, Ratnasingham S, de Waard J: Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc R Soc B. 2003, 270 (Suppl 1): S96-S99.View ArticlePubMedPubMed CentralGoogle Scholar
- CBOL Plant Working Group: A DNA barcode for land plants. Proc Natl Acad Sci U S A. 2009, 106 (31): 12794-12797.View ArticlePubMed CentralGoogle Scholar
- Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA, Chen W, Fungal Barcoding Consortium: Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc Natl Acad Sci USA. 2012, 109 (16): 6241-6246. 10.1073/pnas.1117018109.View ArticlePubMedPubMed CentralGoogle Scholar
- Hebert PDN, Gregory T: The promise of DNA barcoding for taxonomy. Syst Biol. 2005, 54: 852-859. 10.1080/10635150500354886.View ArticlePubMedGoogle Scholar
- Schindel D, Miller S: DNA barcoding a useful tool for taxonomists. Nature. 2005, 435: 17-17.View ArticlePubMedGoogle Scholar
- Weitschek E, van Velzen R, Felici G, Bertolazzi P: BLOG 2.0: a software system for character‒based species classification with DNA Barcode sequences: what it does, how to use it. Mol Ecol Resour. 2013, 13 (6): 1043-1046.PubMedGoogle Scholar
- Van Velzen R, Weitschek E, Felici G, Bakker FT: DNA Barcoding of recently diverged species: relative performance of matching methods. PLoS One. 2012, 7 (1): e30490-10.1371/journal.pone.0030490.View ArticlePubMedPubMed CentralGoogle Scholar
- Farris JS: Estimating phylogenetic trees from distance matrices. Am Nat. 1972, 106 (951): 645-668. 10.1086/282802.View ArticleGoogle Scholar
- Saitou N, Nei M: The neighbour-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406-425.PubMedGoogle Scholar
- Munch K, Boomsma W, Huelsenbeck JP, Willerslev E, Nielsen R: Statistical assignment of DNA sequences using Bayesian phylogenetics. Syst Biol. 2008, 57 (5): 750-757. 10.1080/10635150802422316.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.View ArticlePubMedPubMed CentralGoogle Scholar
- Austerlitz F, David O, Schaeffer B, Bleakley K, Olteanu M, Leblois R, Veuille M, Laredo C: DNA barcode analysis: a comparison of phylogenetic and statistical classification methods. BMC Bioinforma. 2009, 14 (Suppl 10): S10View ArticleGoogle Scholar
- Meier R, Kwong S, Vaidya G, Ng Peter KL: DNA barcoding and taxonomy in diptera: a tale of high intraspecific variability and low identification success. Syst Biol. 2006, 55: 715-728. 10.1080/10635150600969864.View ArticlePubMedGoogle Scholar
- DasGupta B, Konwar KM, Măndoiu II, Shvartsman AA: DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics. 2005, 21 (16): 3424-3426. 10.1093/bioinformatics/bti547.View ArticlePubMedGoogle Scholar
- Sarkar IN, Planet PJ, DeSalle R: CAOS software for use in character-based DNA barcoding. Mol Ecol Resour. 2008, 8 (6): 1256-1259. 10.1111/j.1755-0998.2008.02235.x.View ArticlePubMedGoogle Scholar
- Little DP: DNA barcode sequence identification incorporating taxonomic hierarchy and within taxon variability. PLoS One. 2011, 6 (8): e20552-10.1371/journal.pone.0020552.View ArticlePubMedPubMed CentralGoogle Scholar
- Little DP: BRONX2: Barcode Recognition Obtained with Nucleotide eXposés 2.0. 2012, Program distributed by the author http://www.nybg.org/files/scientists/dlittle/BRONX2.html.Google Scholar
- Liu C, Liang D, Gao T, Pang X, Song J, Yao H, Chen S: PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region. BMC Bioinforma. 2011, 12 (Suppl 13): S4-10.1186/1471-2105-12-S13-S4.View ArticleGoogle Scholar
- Albu M, Nikbakht H, Hajibabaei M, Hickey DA: The DNA barcode linker. Mol Ecol Resour. 2011, 11: 84-88. 10.1111/j.1755-0998.2010.02901.x.View ArticlePubMedGoogle Scholar
- Kuksa P, Pavlovic V: Efficient alignment-free DNA barcode analytics. BMC Bioinforma. 2009, 10 (Suppl 14): S9-10.1186/1471-2105-10-S14-S9.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH: The WEKA data mining software: an update. SIGKDD Explorations. 2009, 11 (1): 10-18. 10.1145/1656274.1656278.View ArticleGoogle Scholar
- Platt JC: Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods - Support Vector Learning. Edited by: Scholkopf B, Burges C, Platt JC, Smola AJ. 1998, Cambridge MA: MIT Press, 185-208.Google Scholar
- Cohen WW: Fast effective rule induction. Twelfth International Conference on Machine Learning (ICML). 1995, 95: 115-123.Google Scholar
- Quinlan R: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers. 1993, San Mateo CA: Morgan KaufmannGoogle Scholar
- John GH, Langley P: Estimating Continuous Distributions in Bayesian Classifiers. Eleventh Conference on Uncertainty in Artificial Intelligence. 1995, San Mateo, CA: Morgan Kaufmann, 338-345.Google Scholar
- Bertolazzi P, Felici G, Weitschek E: Learning to classify species with barcodes. BMC Bioinforma. 2009, 10 (Suppl 14): S7-10.1186/1471-2105-10-S14-S7.View ArticleGoogle Scholar
- Felici G, Truemper K: A MINSAT approach for learning in logic domains. Informs J Comput. 2002, 14: 20-36. 10.1287/ijoc.22.214.171.12409.View ArticleGoogle Scholar
- Meyer CP, Paulay G: DNA barcoding: Error rates based on comprehensive sampling. PLoS Biol. 2005, 3 (12): 2229-2238.View ArticleGoogle Scholar
- Lou M, Golding GB: Assigning sequences to species in the absence of large interspecific differences. Mol Phylogenet Evol. 2010, 56: 187-194. 10.1016/j.ympev.2010.01.002.View ArticlePubMedGoogle Scholar
- Dexter KG, Pennington TD, Cunningham CW: Using DNA to assess errors in tropical tree identifications: how often are ecologists wrong and when does it matter?. Ecol Monogr. 2010, 80: 267-286. 10.1890/09-0267.1.View ArticleGoogle Scholar
- Ratnasingham S, Hebert PDN: Bold: the barcode of life data system. Mol Ecol Notes. 2007, 7: 355-364. 10.1111/j.1471-8286.2007.01678.x.View ArticlePubMedPubMed CentralGoogle Scholar
- Hebert PDN, Stoeckle MY, Zemlak TS, Francis CM: Identification of birds through COI DNA barcodes. PLoS Biol. 2004, 2: 1-7. 10.1371/journal.pbio.0020001.View ArticleGoogle Scholar
- Bishop CM: Neural Networks for Pattern Recognition. 1995, Walton Street, Oxford: Oxford university pressGoogle Scholar
- Wilcoxon F: Probability tables for individual comparisons by ranking methods. Biometrics. 1947, 3 (3): 119-122. 10.2307/3001946.View ArticleGoogle Scholar
- Bonferroni C: Studi in Onore del Professore Salvatore Ortu Carboni. Il calcolo delle assicurazioni su gruppi di teste. 1935, Rome: Tipografi del Senato, 13-60.Google Scholar
- Lehr T, Yuan J, Zeumer D, Jayadev S, Ritchie MD: Rule-based classifier for the analysis of gene-gene and gene-environment interactions in genetic association studies. BioData Mining. 2010, 4 (1): 4View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.