This article has Open Peer Review reports available.
LAF: Logic Alignment Free and its application to bacterial genomes classification
 Emanuel Weitschek^{1, 3}Email author,
 Fabio Cunial^{2} and
 Giovanni Felici^{3}
Received: 30 March 2015
Accepted: 30 November 2015
Published: 8 December 2015
Abstract
Alignmentfree algorithms can be used to estimate the similarity of biological sequences and hence are often applied to the phylogenetic reconstruction of genomes. Most of these algorithms rely on comparing the frequency of all the distinct substrings of fixed length (kmers) that occur in the analyzed sequences.
In this paper, we present Logic Alignment Free (LAF), a method that combines alignmentfree techniques and rulebased classification algorithms in order to assign biological samples to their taxa. This method searches for a minimal subset of kmers whose relative frequencies are used to build classification models as disjunctivenormalform logic formulas (ifthen rules).
We apply LAF successfully to the classification of bacterial genomes to their corresponding taxonomy. In particular, we succeed in obtaining reliable classification at different taxonomic levels by extracting a handful of rules, each one based on the frequency of just few kmers.
State of the art methods to adjust the frequency of kmers to the character distribution of the underlying genomes have negligible impact on classification performance, suggesting that the signal of each class is strong and that LAF is effective in identifying it.
Keywords
Background
The field of biological sequence analysis relies on mathematical, statistical, and computer science methods for discovering similarities among different organisms, understanding their features and their structure, detecting ancestry, relatedness, evolution, and common functions.
Several wellestablished sequence comparison algorithms are based on sequence alignment: they compute sequence similarity by aligning portions of sequences (e.g., subsequences) that have common nucleotide assignments. The alignments of two or more sequences are scored according to the number of common nucleotides. Such methods can be exact or heuristic. Among exact methods, SmithWaterman [1] and NeedlemanWunsch [2] use dynamic programming techniques. The first performs local sequence alignment: it detects the common regions between two sequences by comparing segments of all possible lengths. The second is a global alignment algorithm, designed to align entire sequences. In order to reduce the computational burden of exact methods, several heuristic algorithms have been designed, the most renowned being FASTA [3] and BLAST [4]. For the comparisons of more than two sequences, there are adhoc algorithms like Muscle [5], ClustalW [6], Motalign [7], and Mafft [8]. Alignmentbased sequence analysis algorithms have a very high computational cost, especially when applied to a large set of sequences [9]. Other problems may also be encountered when performing alignment on genome sequences, related with the presence of noncoding subsequences, or simply with the computational burden associated with the alignment of whole genomes [10].
In order to address these issues, alignmentfree sequence analysis methods can be considered. Such algorithms are mainly classified in two groups: methods based on sequence compression and methods that rely on the frequencies of the subsequences (oligomers) [9].
The first class of methods compute a model that succinctly describes the sequence, and assess the similarity of the sequences by analyzing their compressed representations, e.g., Kolomogorov complexity [11] or Universal Sequence Maps [12].
In this work we focus on the second class of methods, alignmentfree algorithms that rely on oligomer frequencies and map two strings X and Y onto corresponding multidimensional vectors X and Y; these vectors are indexed by a number of substrings in the given alphabet (a typical case is when all possible substrings of a predefined length k are used). X[W] and Y[W] – the element of X and Y associated with substring W – contain the number of occurrences of W in X and Y respectively. Often the number of occurrences is normalized and converted into a measure of statistical surprise using the length and distribution of characters in each string. Standard distance functions on vectors are then applied to X and Y, allowing the original strings to be compared by classical distancebased algorithms.
Alignmentfree algorithms are currently the most scalable class of methods for reconstructing phylogenetic trees from thousands of large, distantlyrelated genomes and proteomes [13, 14].
The success of alignmentfree methods rests on extensive information on the substring composition of genomes and on codonusage biases, cumulated over approximately fifty years, with particular emphasis on prokaryotes: from the first studies of GC content [15], to the first detection of biases in the composition of pairs and quadruples of adjacent nucleotides [15–21], to the discovery of speciesspecific frequencies of 4mers and 8mers preserved in DNA fragments ranging from 40 kilobases to 400 bases [22–26], to more recent, unsupervised classifications [27–29] and more complex protein motifs [30].
Since the very beginning, most such studies have relied on some form of noise filtration, either assuming an independent and identically distributed source or a Markov source of low order (i.e., normalizing the raw frequencies using their expectation and or variance according to the specified sources). Markov chains inferred from genomes have indeed been shown to reproduce large fractions of the frequency distribution of kmers in the original genomes [23, 31, 32].
So far, classification has always relied on the frequency of allkmers [27, 33], and minimality in phylogenetic signal has been investigated with respect to the length of the strings from which kmers are extracted, rather than to the space of features used for classification. This trend continues in modern applications of kmer composition to annotating and binning metagenomic reads [34]: increasingly more sophisticated heuristics have allowed to reliably classify reads ranging from one kilobase to 75 bases, under a variety of species abundance scenarios [35–40]. However, fundamental questions on the distribution and concentration of phylogenetic signal in the space of all kmers are still open and scarcely investigated. Among the few attempts in this direction, we mention the use of singular value decomposition (SVD [41, 42]) and of irredundant shared substrings [43] in phylogeny reconstruction, the use of few selected kmers in barcoding genes [44], and early attempts at classifying protein families using the frequency of a small set of dipeptides [45].
In this paper, we experiment with four rulebased algorithms [51] that extract classification models in the form of logic formulas and we compare them with other stateoftheart classifiers, such as Support Vector Machines [58, 69] and Nearest Neighbor [70]. Surprisingly, it turns out that we can reliably classify genomes at multiple taxonomic levels using a limited number of formulas, each involving few, short kmers. Moreover, standard noise filtration methods have minimum impact on classification performance, suggesting that noise is automatically dampened by the formulaextraction algorithms.
Methods
In this section, we present the Logic Alignment Free (LAF) technique and software package. The aim of LAF is to classify biological sequences and assign them to their taxonomic unit with the aid of a supervised machine learning paradigm [51] (see subsection Supervised machine learning and rulebased classification algorithms for more details). LAF uses a feature vector representation of the biological sequences, and gives them as input to rulebased classification algorithms (for a detailed analysis of rulebased classification methods, see [52]).
In [53], LAF has been already successfully applied to the classification of selectively constrained DNA elements, which are not alignable and do not come from the same gene regions.
Conversely, here we present the method in detail, provide the scripts and the software, and describe its application to bacterial genomes. In the following subsections, we illustrate the feature vector representation technique, the rulebased classification algorithms, and their integration in the LAF framework.
Representing the sequences as feature vectors with alignmentfree methods
The most widespread alignmentfree methods compute the frequencies of the substrings in the biological sequences, called kmers (where k is the length of the substring). For each sequence, the substring frequencies are then represented in a vector, called frequency vector [12, 54–57]. Each element of this vector expresses the frequency of a given kmer, computed by scanning a sliding window of length k over the sequence.
More formally [9], let S be a sequence of n characters over an alphabet Σ, e.g. Σ={A,C,G,T}, and let k∈ [ 1…n]. If K is a generic substring of S of length k, K is called a kmer. Let the set V={K_{1},K_{2},…,K_{ m }} be all possible kmers over Σ, and define m=Σ^{ k } to be the size of set V. The kmers are computed by counting the occurrences of the substrings in S with a sliding window of length k over S, starting at position 1 and ending at position n−k+1. A vector F contains for each kmer the corresponding counts F=c_{1},c_{2},…,c_{ m }. The frequencies are then computed accordingly and stored in a vector F^{′}=f_{1},f_{2},…,f_{ m }; for a kmer K_{ i }, the frequency is defined as \(f_{i}=\frac {c_{i}}{nk+1}\).
These numerical representations of the sequences allow the use of statistical and mathematical techniques; indeed, the most used approach for sequence comparisons in alignmentfree vector representations are distance measures, such as the Euclidean distance and the d2 distance [9]. While the authors of [56] use feature vector representation in combination with supervised machine learning methods, specifically Support Vector Machines [69] for biological and text sequences, here we propose to analyze the frequency vectors with rulebased supervised machine learning algorithms. The effectiveness of this technique is investigated and tested on bacterial sequences.
Supervised machine learning and rulebased classification algorithms
The aim of this step is to classify the biological sequences into their taxonomic unit. Once the sequences are represented in a vector space, it is possible to analyze them by adopting a supervised machine learning approach, sketched in the following.
Given a set B of biological sequences, each assigned to a taxon (training set), a classifier is trained with these sequences in order to compute a classification model that predicts the taxon of each sequence from the values of its vector space representation. An additional set of sequences with known taxa is used to evaluate whether the model computed on the training set is able to predict correctly the taxa (the latter is called test set). For assessing the performance of the classifier we adopt the accuracy measure (A), also called correct rate \(A=\frac {c}{t}\), where c is the number of correct classified sequences in the test set and t is the number of total sequences in the test set.
We focus on a particular type of classification methods  rulebased classifiers  which express the classification model in propositional logic form (e.g., ifthen rules). Rulebased classifiers have the main advantage of being able to control their dimension (in this case, the number of kmers used), they are easily interpretable, and can straightforwardly be integrated with other contextual knowledge. Several rulebased classification methods are proposed in the literature; in LAF we adopt the following ones: Data Mining Big (DMB) [59, 60], RIDOR [61], PART [62], and RIPPER [63]. All these methods use distinct rule extraction approaches, but – as we will see later – perform very well on the analyzed data sets of bacterial sequences. We report a brief description of these methods in the following.
 1.
discretization: conversion of numeric attributes into nominal (discrete);
 2.
discrete cluster analysis: samples that are similar in the discretized space are clustered and dimensionreduced accordingly;
 3.
feature selection: the most relevant attributes for classification purpose are selected;
 4.
rule extraction: small and effective rules are extracted from training data and verified on test data;
 5.
classification: the extracted rules are used to classify new samples.
RIDOR[61] performs rule extraction directly from the training data set. The first step is the computation of a default rule for the most frequent class (e.g., “all sequences are E. coli”). Then, it computes exception rules that represent the other classes (e.g., “except if freq(ACGT)<0.45 then the sequences are S. aureus”).
PART [62] performs rules extraction with an indirect method. It uses the C4.5 decision tree based classification algorithm [66], which computes a pruned decision tree for a given number of iterations. The best performing tree in terms of classification performances is chosen by PART and converted to rules for every species.
RIPPER [63] is a direct rule extraction method based on a pruning procedure, whose aim is to minimize the error on the training set; it performs the following steps: i) growth of the rules; ii) pruning of the rules; iii) optimization of the model; iv) selection of the model. In the first step, thanks to a greedy procedure, RIPPER extracts many classification rules. Then, the rules are simplified and optimized in step two and three, respectively. Finally, the best model (i.e., set of rules) is selected.
Logic Alignment Free (LAF) method
Rulebased classifiers have been successfully used in the analysis of aligned sequences, e.g., in [59] and [60], where the classification of biological sequences to their species is performed by considering only sequences from the same gene region. In this case the ruleextraction procedure identifies exact gene regions and nucleotide assignments that are specific to a species; an example of such a rule could be ‘’if pos354 = T of gene 16S then the sequence belongs to E. Coli”.

The genome g is reverse complemented, the kmers with k∈ [ 3…6] are counted and stored in a frequency vector F^{′};

A matrix that contains all frequency vectors is created; the rows of the matrix are associated to the kmers and the columns to the sequences (an example is given in Table 1);Table 1
Example of frequencies vectors matrix extracted by LAF and provided as input to rulebased classifiers
S e q _{1}
S e q _{2}
…
S e q _{n−1}
S e q _{ n }
E. Coli
E. Coli
…
S. Aureus
S. Aureus
AAA
0.46
0.26
…
0.24
0.26
AAC
0.12
0.16
…
0.23
0.24
AAG
0.13
0.23
…
0.23
0.22
…
…
…
…
…
…

The frequencies are discretized with the MDL procedure [67] before applying RIDOR, PART and RIPPER, while DMB provides its own builtin discretization method;

A set of four rulebased classifiers (e.g., DMB, RIDOR, PART and RIPPER) take the matrix as input and extract the classification models and specimen to taxonomic unit assignments;

The above is repeated for different combinations of training / test sets.
Data sets of bacterial genomes
In order to prove the validity of the LAF technique, we chose to test the method for the classification of biological sequences belonging to the bactria domain. We downloaded 1964 bacterial genomes from the NCBI genomes database (www.ncbi.nlm.nih.gov/genome/browse/). For every downloaded sequence, we query the NCBI taxonomy service (scripts are available at dmb.iasi.cnr.it/laf.php) to retrieve the full lineage, i.e., Species, Genus, Order, Class, Phylum. In order to perform an effective classification, we do not take into consideration underrepresented species and therefore we filter out sequences with less than nine specimens. This step is necessary to perform a proper training of the classifiers. The final filtered data set is composed of 413 sequences with 25 species, 21 genera, 14 orders, 9 classes, and 6 phyla. Additionally, we also report the performances on the original data set (1964 bacterial genomes, 1157 species, 590 genera, 120 orders, 57 classes, and 36 phyla).
Results and discussion
We apply LAF to the previously described filtered data set of bacterial genomes, setting k∈ [ 3…6] and using the four already mentioned rulebased classification algorithms by adopting a 10fold cross validation sampling scheme. We show also the results on the original data set composed of 1964 sequences. Additionally, we compare the results of LAF with respect to the Support Vector Machine (SVM) classifier [69] and with respect to a Nearest Neighbor approach [70].
Percent accuracy of the rulebased classifiers for each taxonomic unit (10fold cross validation) on the filtered data set
Level  RIPPER  RIDOR  PART  DMB  Avg ±std.dev 

Species  93.21  97.33  96.36  97.61  96.13 ±2.0 
Genus  93.98  98.79  97.10  98.44  97.08 ±2.2 
Order  98.79  99.27  98.31  98.58  98.74 ±0.4 
Class  96.50  97.81  98.79  97.06  97.79 ±0.9 
Phylum  96.88  98.78  98.07  98.53  98.06 ±0.8 
Avg ±std.dev  95.87 ±2.2  98.40 ±0.8  97.72 ±1.0  98.24 ±0.4  97.55 ±1.0 
Accuracy (ACC) [%] and computational times (T) [sec] on the order level with different values of K
Data set  Classifier  K=3  K=4  K=5  K=6  

ACC  T  ACC  T [s]  ACC  T  ACC  T  
Original  RIPPER  64.50  37.08  69.82  83.53  69.76  203.53  69.92  765.34 
Original  RIDOR  61.63  71.17  62.25  320.72  64.19  1509.75  64.75  10320.40 
Original  PART  65.37  12.67  67.05  24.58  67.77  70.13  70.02  280.23 
Original  SVM  70.69  605.55  85.37  937.32  88.59  1312.52  89.56  2020.60 
Original  NN  83.27  9.56  85.67  12.13  86.49  19.34  87.06  114.48 
Filtered  RIPPER  98.79  0.82  98.79  1.55  99.27  4.56  98.79  27.76 
Filtered  RIDOR  96.12  1.58  99.27  3.05  96.36  26.16  97.33  34.31 
Filtered  PART  97.34  0.51  98.31  1.00  97.58  2.28  97.09  23.11 
Filtered  SVM  99.56  10.62  99.87  11.58  99.65  13.10  99.68  14.71 
Filtered  NN  99.45  1.99  99.93  3.30  99.34  3.70  99.63  4.18 
Average    83.67  75.2  86.63  139.88  86.90  316.51  87.38  1360.51 
In Table 2, we report the average accuracy over all classification algorithms on the filtered data set. We note that the best results (98 % accuracy) are obtained for the phylum level – the highest in the taxonomy. Accuracy remains greater than 96 % at lower levels as well. According to the average over all taxonomic levels, RIDOR exhibits the best performance.
Moreover, we compare LAF with respect to the Support Vector Machine (SVM) classifier. We adopt the Weka implementation of SVM (called SMO) with a linear kernel and a soft margin. We obtain an accuracy of 99 % on the filtered data sets with a 10fold cross validation sampling scheme, which slightly outperforms LAF. But we remark that SVM outputs just a single classification model that cannot be easily interpreted by human experts.
Finally, we evaluate also the performances of the Nearest Neighbour (NN) classifier by using the Weka implementation of NN (called IBk) and by setting the number of neighbours to 1, the NN search algorithm to linear, and by adopting the Euclidean distance. Also in this case we obtain an accuracy of 99 % on all filtered data sets with a 10fold cross validation sampling scheme, but no human readable classification model.
A sample of classification rules at the species level extracted by the DMB software. f(W) represents the relative frequency of substring W in a genome, multiplied by 10^{5} for readability
A. baumannii  f(GTAC)≥229.10∧f(TGCA)≥515.63 
B. cereus  384.04≤f(CTCA)<490.11∧819.04≤f(TCCA)<875.80 
B. animalis  762.28≤f(TCCA)<819.04∧469.35≤f(TGCA)<515.63 
B. longum  f(GTAC)≥229.10∧330.52≤f(TGCA)<376.80 
B. aphidicola  57.77≤f(AGGC)<182.81 
C. jejuni  490.11≤f(CTCA)<596.17∧353.97≤f(CTGA)<451.85 
C. trachomatis  305.55≤f(GGAC)<393.10∧875.80≤f(TCCA)<932.56 
C. botulinum  371.77≤f(ACTC)<434.37∧112.00≤f(GCAC)<261.71 
C. diphtheriae  819.04≤f(TCCA)<875.80∧423.07≤f(TGCA)<469.35 
C. pseudotuberculosis  875.80≤f(TCCA)<932.56∧423.07≤f(TGCA)<469.35 
E. coli  710.86≤f(GCAC)<860.58∧415.84≤f(GCTA)<525.98 
F. tularensis  592.00≤f(TCCA)<648.76∧330.52≤f(TGCA)<376.80 
H. influenzae  549.73≤f(CTGA)<647.60∧130.47≤f(GGAC)<218.01 
H. pylori  5.56≤f(GTAC)<42.82 
L. monocytogenes  411.43≤f(GCAC)<561.15∧305.55≤f(GGAC)<393.10 
M. tuberculosis  649.71≤f(ATCA)<772.78 
N. meningitidis  590.29≤f(GATA)<754.27∧376.80≤f(TGCA)<423.07 
P. marinus  (f(AGGA)<602.46∨f(AGGA)≥706.28)∧f(GCTA)<856.37 
∧117.33≤f(GTAC)<154.58  
S. enterica  525.98≤f(GCTA)<636.11∧393.10≤f(GGAC)<480.64 
S. aureus  1082.23≤f(GATA)<1246.22∧f(GTAC)≥229.10 
S. pneumoniae  393.10≤f(GGAC)<480.64∧154.58≤f(GTAC)<191.84 
S. pyogenes  596.06≤f(AGTA)<733.86∧1082.23≤f(GATA)<1246.22 
S. suis  918.25≤f(GATA)<1082.23∧330.52≤f(TGCA)<376.80 
S. islandicus  218.01≤f(GGAC)<305.55∧284.24≤f(TGCA)<330.52 
Y. pestis  596.17≤f(CTCA)<702.24∧f(CTGA)≥941.24 
We observe that the same 4mer is able to distinguish 3 and 2 bacterial species with different frequency values, respectively, and that twenty 4mers suffice to separate all the 25 species. The classification rules are also very concise, since most of them are composed only by the conjuction of the conditions on two 4mers (in the logic jargon, such rules are conjunctive clauses composed of two literals). In general, the rules computed for distinct species do not seem to use disjoint, speciesspecific sets of kmers, suggesting that discrimination critically depends on the frequency of a kmer rather than on its simple presence or absence in a species. Additional considerations derive from the granularity of the adopted discretization. The method allows to specify upfront the number of intervals used to discretize the frequency values of each kmer, and then searches for an optimal discretization under this condition. From the experimental results we conclude that the number of intervals in which frequencies are discretized has minimal effects on classification quality, provided that at least 3 intervals are used (results not reported).
Percent accuracy of the classifiers for each taxonomic unit (10fold cross validation) on the original data set
Level  RIPPER  RIDOR  PART  DMB  SVM  NN  Avg ±std.dev  

Species                
Genus  54.17  47.67  50.17  48.54    73.04  45.60 ±24.2  
Order  69.82  62.25  67.05  63.78  85.37  85.68  72.32 ±10.5  
Class  75.08  69.92  71.76  72.05  88.43  89.10  77.72 ±8.7  
Phylum  75.85  70.99  56.77  71.45  85.93  86.08  74.51 ±8.2  
Avg ±std.dev  68.73 ±10.0  62.71 ±10.7  61.44 ±9.7  63.96 ±11  64.93 ±43.3  83.48 ±7.1  67.54 ±14.8 

The first type consists in excluding all highfrequency and lowcomplexity substrings [74] of a genome from its kmer counts, using the DUST software implementation provided by NCBI [78];

A second type of preprocessing consists in replacing the frequency f_{ T }(W) of a kmer W in a string T with a measure of the statistical significance of the event that W has f_{ T }(W) occurrences in T. Specifically, we assigned to a kmer W the score \(z_{T}(W) = \left (p_{T}(W)\tilde {p}_{T}(W)\right)\!/\tilde {p}_{T}(W)\), where p_{ T }(W)=f_{ T }(W)/(T−k+1), and where \(\tilde {p}_{T}(W) = p_{T}(W[\!1..k1]) \cdot p_{T}(W[\!2..k]) / p_{T}(W[\!2..k1])\) is the expected value of p_{ T }(W) under the assumption that T was generated by a Markov process of order k−2 or smaller. This score has been shown to be critical in building accurate phylogenies of distantlyrelated prokaryotes [75];

We experimented with the estimator \(\tilde {p}_{T}(W) = \left (f_{T}(W[\!1]) \cdot f_{T}(W[\!2..k]) + f_{T}(W[\!1..k1]) \cdot f_{T}(W[\!k]) \right)/2\), derived under the assumption that W[ 2..k−1], W[ 1] and W[ k] occur independently in T [76];

We also adopted an even simpler estimator, based on singlenucleotide frequencies (see [9, 77] and references therein for alternative ways to compute \(\tilde {p}_{T}(W)\)).
In our experiments, none of these preprocessing methods yielded a visible improvement on classification quality, suggesting that noise is automatically dampened by the formulaextraction algorithms run on raw frequencies. Nonetheless, we include in our LAF package an implementation of all such filters, since they could be useful in other data sets.
Conclusions and future work
The LAF method combines kmer composition vectors and rulebased classification algorithms to classify biological sequences. Such sequences do not need to be aligned or to belong to the same gene. The method was applied to bacterial whole genomes, and it was able to perform with accurate classification results and to identify common subsequences (kmers) in each taxon (class) of the data set.
We compared our method with other stateofthe art classification methods and provided experimental results that show promising performance of LAF in particular in the classification model extraction (i.e., specific kmers for each taxon).
Several directions for future research stem from the results obtained in this paper: further reducing the size of the classification models, analyzing more deeply the kmers selected by our models; and measuring how classification performance degenerates when moving from whole genomes to short fragments.
Another possible way to further reduce the size of our models consists in building hierarchical classification rules by extracting logic formulas that best discriminate between elements in a taxonomic unit \(\mathcal {T}\) and elements in \(\text {\texttt {parent}}(\mathcal {T}) \backslash \mathcal {T}\), where \(\text {\texttt {parent}}(\mathcal {T})\) is the parent of \(\mathcal {T}\) in the taxonomic tree. Such result would look very similar to a decision tree, and the corresponding kmers could be related to the notion of crowns (see [79]).
Analyzing the actual kmers selected by our models is another obvious open direction, for example in terms of syntactic similarity and positional correlations between the kmers that appear in the same formula, or in terms of enrichment of such kmers in regulatory regions or in gene families devoted to specific cellular processes.
It is also of interest the understanding of how the classification performance degenerates when moving from whole genomes to short fragments, for example by determining how small a fragment we can classify correctly using the formulas learned from entire genomes, or using new formulas learned from fragments. Abundance estimation in metagenomic samples is also a natural application for the strong biases in the relative frequency of kmers that we report here: given a set of observed kmer frequencies in a sample, and a set of logic rules in sequenced genomes, the problem would then amount to compute the most probable abundance of known species in the sample.
Declarations
Acknowledgements
The authors are grateful to the organizing committee of the 5th Biological Discovery Workshop (Biokdd 2014) for inviting them to write and publish the manuscript in Biodata Mining. The authors would like to thank Giulia Fiscon for the precious advices and for revising the paper and prof. Paola Bertolazzi for providing a stimulating research environment and fruitful scientific discussions. This paper is dedicated to prof. Alberto Apostolico. The authors have been supported by the Italian PRIN “GenData 2020” (2010RTFWBH), the FLAGSHIP “InterOmics” project (PB.P05), and by Academy of Finland under grant 250345 (Center of Excellence in Cancer Genetics Research).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Pearson WR. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the smithwaterman and fasta algorithms. Genomics. 1991; 11(3):635–50.PubMedView ArticleGoogle Scholar
 Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443–53.PubMedView ArticleGoogle Scholar
 Pearson WR. Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol. 1990; 183:63–98.PubMedView ArticleGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al.Gapped blast and psiblast: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389–402.PubMedPubMed CentralView ArticleGoogle Scholar
 Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792–7.PubMedPubMed CentralView ArticleGoogle Scholar
 Thompson JD, Gibson T, Higgins DG. Multiple sequence alignment using clustalw and clustalx. Curr Protocol Bioinformatics. 2002; 00:2.3:2.3.1–2.3.22.Google Scholar
 Mokaddem A, Elloumi M. Motalign: A multiple sequence alignment algorithm based on a new distance and a new score function. In: DEXA Workshops. Los Alamitos, CA, USA: IEEE Computer Society: 2013. p. 81–4.Google Scholar
 Katoh K, Misawa K, Kuma Ki, Miyata T. Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 2002; 30(14):3059–66.PubMedPubMed CentralView ArticleGoogle Scholar
 Vinga S, Almeida J. Alignmentfree sequence comparisona review. Bioinformatics. 2003; 19(4):513–23.PubMedView ArticleGoogle Scholar
 Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL. Alignment of whole genomes. Nucleic Acids Res. 1999; 27(11):2369–76.PubMedPubMed CentralView ArticleGoogle Scholar
 Li M, Vitnyi PMB. An Introduction to Kolmogorov Complexity and Its Applications, 3rd ed. New York, USA: Springer; 2008.View ArticleGoogle Scholar
 Almeida JS, Vinga S. Universal sequence map (usm) of arbitrary discrete sequences. BMC Bioinformatics. 2002; 3:6.PubMedPubMed CentralView ArticleGoogle Scholar
 Vinga S. Biological sequence analysis by vectorvalued functions: revisiting alignmentfree methodologies for DNA and protein classification In: Pham TD, Yan H, Crane DI, editors. Advanced Computational Methods for Biocomputing and Bioimaging. New York: Nova Science Publishers: 2007.Google Scholar
 Vinga S, Almeida J. Alignmentfree sequence comparison – a review. Bioinformatics. 2003; 19(4):513–23.PubMedView ArticleGoogle Scholar
 Bentley SD, Parkhill J. Comparative genomic structure of prokaryotes. Annu Rev Genet. 2004; 38:771–91.PubMedView ArticleGoogle Scholar
 Josse J, Kaiser A, Kornberg A. Enzymatic synthesis of deoxyribonucleic acid. J Biol Chem. 1961; 236:864–75.PubMedGoogle Scholar
 Trautner T, Swartz M, Kornberg A. Enzymatic synthesis of deoxyribonucleic acid, x. influence of bromouracil substitutions on replication. Proc Natl Acad Sci U S A. 1962; 48(3):449.PubMedPubMed CentralView ArticleGoogle Scholar
 Russell G, Walker P, Elton R, SubakSharpe J. Doublet frequency analysis of fractionated vertebrate nuclear DNA. J Mol Biol. 1976; 108(1):1–20.PubMedView ArticleGoogle Scholar
 Russell G, SubakSharpe J. Similarity of the general designs of protochordates and invertebrates. Nature. 1977; 266(5602):533–6.PubMedView ArticleGoogle Scholar
 Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.PubMedView ArticleGoogle Scholar
 Karlin S, Mrázek J. Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci. 1997; 94(19):10227–32.PubMedPubMed CentralView ArticleGoogle Scholar
 Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol. 2004; 6(9):938–47.PubMedView ArticleGoogle Scholar
 Zhou F, Olman V, Xu Y. Barcodes for genomes and applications. BMC Bioinformatics. 2008; 9(1):546.PubMedPubMed CentralView ArticleGoogle Scholar
 Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999; 16(10):1391–9.PubMedView ArticleGoogle Scholar
 Sandberg R, Winberg G, Bränden CI, Kaske A, Ernberg I, Cöster J. Capturing wholegenome characteristics in short sequences using a naive bayesian classifier. Genome Res. 2001; 11(8):1404–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003; 13(2):145–58.PubMedPubMed CentralView ArticleGoogle Scholar
 Gatherer D. Genome signatures, selforganizing maps and higher order phylogenies: A parametric analysis. Evol Bioinformatics Online. 2007; 3:211.Google Scholar
 Takahashi M, Kryukov K, Saitou N. Estimation of bacterial species phylogeny through oligonucleotide frequency distances. Genomics. 2009; 93(6):525–33.PubMedView ArticleGoogle Scholar
 Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner FO. Tetra: a webservice and a standalone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinformatics. 2004; 5(1):163.PubMedPubMed CentralView ArticleGoogle Scholar
 Rigoutsos I, Floratos A, Ouzounis C, Gao Y, Parida L. Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins. Proteins. 1999; 37(2):264–77.PubMedView ArticleGoogle Scholar
 Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA kmer spectra: models and modalities. Genome Biol. 2009; 10(10):108.View ArticleGoogle Scholar
 Oğul H, Mumcuoğlu EÜ. Svmbased detection of distant protein structural relationships using pairwise probabilistic suffix trees. Comput Biol Chem. 2006; 30(4):292–9.PubMedView ArticleGoogle Scholar
 Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997; 179(12):3899–913.PubMedPubMed CentralView ArticleGoogle Scholar
 Foerstner KU, von Mering C, Hooper SD, Bork P. Environments shape the nucleotide composition of genomes. EMBO Rep. 2005; 6(12):1208–13.PubMedPubMed CentralView ArticleGoogle Scholar
 McHardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I. Accurate phylogenetic classification of variablelength DNA fragments. Nat Methods. 2007; 4(1):63–72.PubMedView ArticleGoogle Scholar
 Chatterji S, Yamazaki I, Bai Z, Eisen JA. Compostbin: A dna compositionbased algorithm for binning environmental shotgun reads. In: Research in Computational Molecular Biology. Berlin: Springer: 2008. p. 17–28.Google Scholar
 Leung HC, Yiu S, Yang B, Peng Y, Wang Y, Liu Z, et al.A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011; 27(11):1489–95.PubMedView ArticleGoogle Scholar
 Wang Y, Leung HC, Yiu S, Chin FY. Metacluster 4.0: a novel binning algorithm for ngs reads and huge number of species. J Comput Biol. 2012; 19(2):241–9.PubMedView ArticleGoogle Scholar
 Tanaseichuk O, Borneman J, Jiang T. Separating metagenomic short reads into genomes via clustering. In: Algorithms in Bioinformatics. New York, NY, USA: Springer: 2011. p. 298–313.Google Scholar
 Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignmentfree sequence comparison based on next generation sequencing reads. In: Research in Computational Molecular Biology. Berlin: Springer: 2012. p. 272–85.Google Scholar
 Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics. 2002; 18(1):100–8.PubMedView ArticleGoogle Scholar
 Stuart GW, Moffett K, Leader JJ. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol. 2002; 19(4):554–62.PubMedView ArticleGoogle Scholar
 Comin M, Verzotto D. Wholegenome phylogeny by virtue of unic subwords. In: Database and Expert Systems Applications (DEXA), 2012 23rd International Workshop On. Los Alamitos, CA, USA: IEEE Computer Society: 2012. p. 190–4.Google Scholar
 Kuksa P, Pavlovic V. Efficient alignmentfree DNA barcode analytics. BMC Bioinformatics. 2009; 10(Suppl. 14):9.View ArticleGoogle Scholar
 Solovyev VV, Makarova KS. A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Comput Appl Biosci: CABIOS. 1993; 9(1):17–24.PubMedGoogle Scholar
 Ratnasingham S, Hebert PDN. BOLD: the barcode of life data system. Mol Ecol Notes. 2007; 7:355–64.PubMedPubMed CentralView ArticleGoogle Scholar
 Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics. 2011; 12(Suppl 2):4.View ArticleGoogle Scholar
 Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique cladespecific marker genes. Nat Methods. 2012; 9(8):811–4.PubMedPubMed CentralView ArticleGoogle Scholar
 Edwards RA, Olson R, Disz T, Pusch GD, Vonstein V, Stevens R, et al.Real time metagenomics: Using kmers to annotate metagenomes. Bioinformatics. 2012; 28(24):3316–17.PubMedPubMed CentralView ArticleGoogle Scholar
 Seth S, Välimäki N, Kaski S, Honkela A. Exploration and retrieval of wholemetagenome sequencing samples. Bioinformatics. 2014; 30(17):2471–9.PubMedPubMed CentralView ArticleGoogle Scholar
 Weitschek E, Fiscon G, Felici G. Supervised dna barcodes species classification: analysis, comparisons and results. BioData Mining. 2014; 7:4.PubMedPubMed CentralView ArticleGoogle Scholar
 Lehr T, Yuan J, Zeumer D, Jayadev S, Ritchie M. Rule based classifier for the analysis of genegene and geneenvironment interactions in genetic association studies. BioData Mining. 2011; 4(1):4. doi:http://dx.doi.org/10.1186/1756038144.PubMedPubMed CentralView ArticleGoogle Scholar
 Polychronopoulos D, Weitschek E, Dimitrieva S, Bucher P, Felici G, Almirantis Y. Classification of selectively constrained dna elements using feature vectors and rulebased classifiers. Genomics. 2014; 104(2):79–86.PubMedView ArticleGoogle Scholar
 Kudenko D, Hirsh H. Feature generation for sequence categorization. In: AAAI/IAAI. Cambridge, USA: The MIT Press: 1998. p. 733–8.Google Scholar
 BenHur A, Brutlag D. Remote homology detection: a motif based approach. Bioinformatics. 2003; 19(suppl 1):26–33.View ArticleGoogle Scholar
 Xing Z, Pei J, Keogh E. A brief survey on sequence classification. ACM SIGKDD Explorations Newslett. 2010; 12(1):40–8.View ArticleGoogle Scholar
 Kuksa P, Pavlovic V. Efficient alignmentfree dna barcode analytics. BMC Bioinformatics. 2009; 10 Suppl 14:9. doi:http://dx.doi.org/10.1186/1471210510S14S9.View ArticleGoogle Scholar
 Vapnik VN, Vapnik V. Statistical Learning Theory. New York, NY, USA: Wiley; 1998.Google Scholar
 Bertolazzi P, Felici G, Weitschek E. Learning to classify species with barcodes. BMC Bioinformatics. 2009; 10(S14):7.View ArticleGoogle Scholar
 Weitschek E, Lo Presti A, Drovandi G, Felici G, Ciccozzi M, Ciotti M, et al.Human polyomaviruses identification by logic mining techniques. BMC Virol J. 2012; 58(9):1–6.Google Scholar
 Gaines BR, Compton P. Induction of rippledown rules applied to modeling large databases. J Intell Inf Syst. 1995; 5(3):211–28.View ArticleGoogle Scholar
 Frank E, Witten IH. Generating accurate rule sets without global optimization. In: Proc. of the 15th Int. Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann: 1998.Google Scholar
 Cohen WW. Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann: 1995. p. 115–23.Google Scholar
 Felici G, Truemper K. A minsat approach for learning in logic domains. INFORMS J Comput. 2002; 13(3):1–17.Google Scholar
 Bertolazzi P, Felici G, Weitschek E. Learning to classify species with barcodes. BMC Bioinformatics. 2009; 10(S14):7.View ArticleGoogle Scholar
 Quinlan JR. Improved use of continuous attributes in C4.5. J Artif Intell Res. 1996; 4:77–90.Google Scholar
 Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. SIGKDD Explor Newsl. 2009; 11(1):10–18. doi:http://dx.doi.org/10.1145/1656274.1656278.View ArticleGoogle Scholar
 Marcais G, Kingsford C. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011; 27(6):764–70. doi:http://dx.doi.org/10.1093/bioinformatics/btr011.PubMedPubMed CentralView ArticleGoogle Scholar
 An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge, UK: Cambridge University Press.Google Scholar
 Dasarathy BV. Nearest Neighbor NN Norms: NN Pattern Classification Techniques. Los Alamitos, CA, USA: IEEE Computer Society Press; 1991.Google Scholar
 Teeling H, Meyerdiekers A, Bauer M, Glockner FO. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol. 2004; 6(9):938–47.PubMedView ArticleGoogle Scholar
 Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 2003; 13:145–58.PubMedPubMed CentralView ArticleGoogle Scholar
 Teeling H, Waldmann J, Lombardot T, Bauer M, Glockner FO. Tetra: a webservice and a standalone program for the analysis and comparison of tetranucleotide usage patterns in dna sequences. BMC Bioinformatics. 2004; 5:163.PubMedPubMed CentralView ArticleGoogle Scholar
 Chan RH, Chan TH, Yeung HM, Wang RW. Composition vector method based on maximum entropy principle for sequence comparison. Comput Biol Bioinform IEEE/ACM Trans. 2012; 9(1):79–87.View ArticleGoogle Scholar
 Qi J, Wang B, Hao BI. Whole proteome prokaryote phylogeny without sequence alignment: a kstring composition approach. J Mol Evol. 2004; 58(1):1–11.PubMedView ArticleGoogle Scholar
 Yu ZG, Zhou LQ, Anh VV, Chu KH, Long SC, Deng JQ. Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment. J Mol Evol. 2005; 60(4):538–45.PubMedView ArticleGoogle Scholar
 Song K, Ren J, Reinert G, Deng M, Waterman MS, Sun F. New developments of alignmentfree sequence comparison: measures, statistics and nextgeneration sequencing. Brief Bioinform. 2014; 15(3):343–53.PubMedView ArticleGoogle Scholar
 Blast Package Version 2.2.257. http://packages.ubuntu.com/precise/ncbiblast+. Accessed Dec 2015.
 Huang K, Brady A, Mahurkar A, White O, Gevers D, Huttenhower C, et al.Metaref: a pangenomic database for comparative and community microbial genomics. Nucleic Acids Res. 2014; 42:617–24.View ArticleGoogle Scholar