 Methodology
 Open Access
 Published:
A new pipeline for structural characterization and classification of RNASeq microbiome data
BioData Mining volume 14, Article number: 31 (2021)
Abstract
Background
Highthroughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNAsequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionalityassociated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs.
Results
Here, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method’s classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method’s performance with that of two stateoftheart methods.
Conclusions
Simulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.
Background
Microorganisms living inside and on humans are known as the microbiota. When integrated with their genes’ information, it is known as the microbiome. The Human Microbiome Project (HMP) was an endeavor for the characterization of the human microbiota to further understanding its impact on human health and diseases [1].
In recent years, biological sciences have experienced substantial technological advances that have led to the rediscovery of systems biology [2,3,4]. These advances were possible thanks to the technological ability to completely sequence the genome from any organism at a low cost [5, 6]. Such advances triggered the development of various analytic approaches and technologies to simultaneously monitoring all the components within cells (e.g., genes and proteins). With the genome information and analytic technologies, the mining and exploration of the resulting data opened up the possibility to better understand biological systems, such as microbial populations, and their complexity. The network structure of such biological systems can give insight into the underlying interactions taking place within those systems [7,8,9,10]. Furthermore, the understanding of these interactions can lead to the discovery of new methods that can help physicians, biologists, scientists, and healthcare workers with disease diagnosis, gene identification, classification of new data, and many other tasks [11].
We initially conducted a literature search in different medical, biological, and engineering databases as well as academic sites prestigious journals such as BMC Bioinformatics, PLOS ONE, ScienceDirect, and IEEE Xplore using the queries “correlation structure for gene expression classifications,” “classifiers for compositional data,” and “classifiers based on correlation structures” in order to identify papers in English using procedures for sample classification based on correlation structures in the 2009–2019 time window. Figure 1 shows the evolution of the number of publications retrieved when the keywords “correlation structure for gene expression classifications” are used. Publications were retrieved from several academic sites, namely BMC Bioinformatics, PLOS One, ScienceDirect, and Scopus. Figure 2 summarizes the current principal stages of gene expression analysis for sample classification.
Operational Taxonomic Unit (OTU) count tables are the usual output when processing the 16S rRNA sequences of microbiota samples [12]. These tables show the relative abundances of the bacteria that make a microbiota population (e.g., the human gut microbiota). OTUbased data have a compositional nature, which makes them difficult to work with [13, 14]. Thus, data transformation is required prior to any further analysis.
Aitchison [15] proposed two transformations to compensate for the data’s compositionality, thus allowing the use of standard metrics in further analysis. The first transformation is the additive logratio (alr), which is defined as:
where x_{j} is an element of {x_{1}, x_{2}, x_{3}…, x_{n}}. Because one value x_{j} is selected as the denominator to build the logratios, the alr has been criticized as being subjective since the outcome depends mostly on the value of x_{j} selected [15,16,17,18].
The second transformation proposed by Aitchison is the centered logratio (clr), which is defined as:
where \( g\left(\boldsymbol{x}\right)={\left({\prod}_{i=1}^n{x}_i\right)}^{\frac{1}{n}} \) is the geometric mean. The use of g(x) avoids the subjectivity of the alr transformation since the method is taking all the information of x [15,16,17,18,19]. The clr transformation has proven to be reliable and has been extensively used in the scientific literature over the years to analyze microbiome data.
In [20] authors proposed a transformation called the isometric logratio (irl) transformation. This approach takes any compositional data x ∈ S^{N}, and computes ilr(x) = z = [z_{1}, z_{2}, …, z_{N}], where z_{i} is calculated as:
However, implementing the ilr transformation poses serious practical difficulties for highdimension data as the computational complexity increases rapidly with dimensionality [21].
Feature selection
After transforming the data, the next step is to separate the data into train, test, and validation sets, although in some cases only the train and test sets are considered. One of the most common problems prior to that step is the limitation of the number of data samples. Indeed, for a normal classifier to be employed using multivariate metrical techniques, the sample size required for optimum training is in order of thousands. This is known as the “curse of dimensionality” problem, and the usual way to overcome this limitation is by using a dimensionality reduction technique to collapse all the attributes (variables) into a lowerdimension space where the most dominant information of the dataset can be retrieved [13, 22].
Feature selection methods are usually separated into three categories: filter, wrapper, and embedded. Table 1 summarizes different approaches for feature selection in gene expression data, the most relevant categories for feature selection, and the current weaknesses when analyzing gene expression data. Filter methods can work with univariate and multivariate data, where univariate methods focus on each feature separately and multivariate methods focus on finding relationships between features [23, 24]. Here we only consider multivariate methods.
The abovementioned filter methods tend to be computationally efficient. Wrapper methods, on the other hand, tend to have a better performance in selecting features since they take a model hypothesis into account, meaning that a training and testing procedure is made in the feature space. However, this approach is computationally inefficient and is more problematic as the feature space grows [23, 26, 29, 30]. Embedded methods make the feature selection based on the classifier (i.e., selected features might not work with any other classifier) and hence tend to have a better computational performance than wrappers. This is the case because the optimal set of descriptors is built when the classifier is constructed and the feature selection is affected by the hypotheses made by the classifier [23, 26, 29,30,31].
In [14], authors presented SParse InversE Covariance Estimation for Ecological ASsociation Inference (SPIECEASI), a novel strategy to infer networks from a high dimensional community compositional data. SPIECEASI estimates the interaction graph from the transformed data using either Recursive Feature selection or Sparse Inverse Covariance selection and seeks to infer an underlying graphical model using conditional independence. In [32] authors proposed a modification of the Support Vector Machine – Recursive Feature Elimination (SVMRFE) algorithm for feature selection. SVMRFE removes one irrelevant feature at each iteration, but this can be troublesome when the number of features is large. Thus, its modification, namely Correlation based Support Vector Machine – Recursive Multiple Feature Elimination (CSVMRMFE), finds the correlated features and removes more than one irrelevant feature per iteration. Rao and S. Lakshminarayanan [13] presented a new significant attribute selection method based on the Partial Correlation Coefficient Matrix (PCCM).
Classification
The final step after finding the most relevant features of the transformed data is to select a classifier. In clinical and bioinformatic research, prediction models are extensively used to derive classification rules useful to accurately predict whether a patient has or would develop a disease, whether the treatment is going to work, or even whether a disease would recur [33,34,35]. Table 2 summarizes the relevant aspects of some widely used classifiers.
Depending on the data, a classifier can belong to one of two groups: supervised or unsupervised [36]. In supervised classification (learning), samples are labeled according to some a prioridefined classes or categories, whereas in unsupervised learning, samples are not labeled, and the classifier clusters the data into different classes or categories after maximizing or minimizing a set of criteria.
Dembélé and Kastner [37] presented a new Fold Change method that can detect differentially expressed genes in microarray data. The traditional fold change method works by calculating the ratio between the averages from the samples (usually two different biological conditions, e.g., control and case samples). Then, cutoff values (e.g., 0.5 for down and 2 for upregulated) are used to select genes under/above such thresholds. This new approach is more accurate and faster than the traditional method and can assign a metric to each differentially expressed gene, which can be used as a selection criterion.
Belciug and F. Gorunescu [43] proposed a novel initialization of a single hidden layer feedforward neural network’s input weights using the knowledge embedded in the connections between variables and class labels. The authors expressed this by the nonparametric GoodmanKruskal Gamma rank correlation instead of the traditional random initialization. The use of this correlation also helped to increase computational speed by eliminating unnecessary features based on the significance of the rank correlation between variables and class labels.
In [42], authors proposed a framework to find information about genes and to classify gene combinations belonging to its relevant subtype using fuzzy logic, which adapts numerical data (input/output pairs) into human linguistic terms, offering good capabilities to deal with noisy and missing data. However, defining the rules and membership functions might require a lot of prior knowledge from a human expert [41]. Dettling and P. Bühlmann [44] proposed a boosting method combining a dimensionality reduction step with the LogitBoost algorithm [45] and compared it to AdaBoost.M1 [46], the nearest neighbor classifier [47], and classification and regression trees (CART) using gene expression data [48]. Dettling and P. Bühlmann showed that, for low dimensional data, LogitBoost can perform slightly better than AdaBoost.M1, and that for real high dimensional data, their approach can outperform the other classifiers in some cases.
In this paper, we present a new method to classify samples into two groups with different characteristics (i.e., phenotypes, health condition, among others) when data of compositional nature is available. Our method relies on a new metric to quantitatively characterize the overall correlation structure deviation when comparing the two datasets and a new dimensionality reduction approach. The proposed method is assessed and compared, based on classification accuracy, to two stateoftheart methods using both synthetic datasets and real datasets from RNA16s sequencing experiments.
Proposed classification method
Here, we explain in detail the proposed classification method. First, in section “Data pretreatment”, we introduce the Data Pretreatment stage, and in section “Assessing correlation structure distortion”, a novel metric to be used as the metric to assess correlation structure distortion is described. Finally, in section “Dimensionality reduction technique”, we present the proposed classification rule, which is based on the previously defined metric and a proposed dimensionalityreduction approach to assess the disruption of a dataset’s correlation structure after a new sample is included.
Data pretreatment
Let \( {X}_c^{\rho}\in {\mathbb{R}}^{n_c\times m} \) and \( {X}_v^{\rho}\in {\mathbb{R}}^{n_v\times m} \) be the OTU count tables where m features are assessed in n_{c} and n_{v} samples from control and case individuals, respectively. In the expressions above, the superindex ρ indicates the datasets are ‘raw’ or without pretreatment. From now on, \( {X}_g^{\rho } \) will represent any of the two groups (g = c for control, or g = v for case).
When analyzing OTU counts tables, a logratio transformation, such as the clr, is to be applied [15, 18, 19] before estimating correlations. However, in order to apply the logratio transformation, it is necessary to consider that compositional count datasets may contain null values resulting from insufficiently large or nonexisting samples. As logratio transformations require data with exclusively positive values, the use of a zeroreplacement method is a must. Here we use the Bayesianmultiplicative (BM) algorithm proposed by MartínFernández [49]. Let \( {\boldsymbol{x}}_{p_i} \) ∈ℝ^{1 × m} be the ith row of the matrix \( {X}_g^{\rho } \) (i = 1, 2, …, n_{g}). The BM algorithm replaces the null counts by
When using the BayesLaplace prior, we set \( n=\sum \limits_{j=1}^m{x}_{p_{i,j}} \), t_{i, j} = m^{−1} and s_{i} = m. Let \( {X}_g^{BM}:= BM\left({X}_g^{\rho}\right) \) be the resulting matrix after the BM algorithm is applied rowwise to \( {X}_g^{\rho } \).
To ensure the data’s compositionality on \( {X}_g^{BM} \), a closure operation [15, 18, 19] is applied to every row of \( {X}_g^{BM} \), as follows:
where k is an arbitrary constant (usually k = 100). Let \( {X}_g^{BM,c}:= c\left( BM\left({X}_g^{\rho}\right)\right) \) be the resulting matrix after the BM algorithm and the closure operation have been applied. Now, the clr transformation is applied to each vector x_{p} ∈ ℝ^{1 × n} \( {X}_g^{BM,c} \), as
where \( g\left({x}_p\right)={\left({\prod}_{i=1}^n{x}_i\right)}^{\frac{1}{n}} \) is the geometric mean. Hence,
Finally, a normalization is applied, resulting in:
where \( {I}_g=\left[1\ 1\dots .1\right]\in {\mathbb{R}}^{n_g\times 1} \) is a column vector of ones, \( {b}_g\in {\mathbb{R}}^{n_g\times 1} \) is a column vector that contains the means of all the variables in X_{g}, and Σ_{g} ∈ ℝ^{m × m} is a diagonal matrix that contains the standard deviation (\( {\sigma}_{g_i} \), for i = 1, …, m) of all variables.
Assessing correlation structure distortion
Here, we introduce φ, a new metric to quantitatively assess the distortion in the correlation structure of a dataset after the incorporation of a new sample. The Pearson correlation matrix for X_{g} is calculated as follows [50]:
Now, consider a new sample, x_{p} ∈ ℝ^{1 × m}. The pretreatment step for this sample yields:
Let \( {\overset{\sim }{X}}_g{\mathbb{R}}^{n_g\times m} \) be the (augmented) dataset X_{g} after incorporating the new sample, and let S_{g} and \( {\overset{\sim }{S}}_g \) be the correlation matrices for X_{g} and \( {\overset{\sim }{X}}_g \), respectively. The spectral decomposition for these matrices is
where
are diagonal matrices containing the eigenvalues for S_{g} and \( {\overset{\sim }{S}}_g \). Let \( {V}_g=\left[{\boldsymbol{v}}_{g_1}\kern0.5em {\boldsymbol{v}}_{g_2}\kern0.5em \begin{array}{cc}\cdots & {\boldsymbol{v}}_{g_m}\end{array}\right]\in {\mathbb{R}}^{m\times m} \) and \( {\overset{\sim }{V}}_g=\left[\begin{array}{cc}{\overset{\sim }{\boldsymbol{v}}}_{g_1}& {\overset{\sim }{\boldsymbol{v}}}_{g_2}\end{array}\kern0.5em \begin{array}{cc}\cdots & {\overset{\sim }{\boldsymbol{v}}}_{g_m}\end{array}\right]\in {\mathbb{R}}^{m\times m} \) be the eigenvector matrices of S_{g} and \( {\overset{\sim }{S}}_g \). Figure 3a illustrates, in a 2dimensional example, the datasets X_{g} and \( {\overset{\sim }{X}}_g \). Figure 3b illustrates the datasets after carrying out the pretreatment, along with their eigenvectors (which are unitary) scaled by their corresponding eigenvalues obtained from the spectral decompositions. Note that scaled eigenvectors mark out the directions of largest variability, capturing high order interactions between the OTUs ruling the overall association structure. Therefore, looking at deviations in both the magnitude and direction of those scaled eigenvectors must give insightful information on overall changes in the association structure of a microbiota population.
Based on the abovementioned remarks, we introduce φ to characterize the distortion produced in the underlying correlation structure when two OTU counts datasets are compared. This metric first requires a dimensional reduction, which will be performed by selecting the principal components for each sample group. This procedure, integrated within the Principal Component Analysis (PCA) algorithm [25], consists of finding the minimum number of eigenvalues a_{g} or \( \tilde{a}_{g} \) (for X_{g} and \( {\overset{\sim }{X}}_g \), respectively) that explain 100(1 − α)% of the total variance, i.e.:
Thus, φ is defined as
where \( \left({\lambda}_{g_j}{\overset{\sim }{\lambda}}_{g_j}\right) \) is the algebraic difference (magnitude deviation) of the jth eigenvalues in Λ_{g} and \( {\overset{\sim }{\Lambda}}_g \), \( {\cos}^{1}\left({\boldsymbol{v}}_{g_j}^T{\overset{\sim }{\boldsymbol{v}}}_{g_j}\right) \) computes angular deviation between the jth eigenvectors in V_{g} and \( {\overset{\sim }{V}}_g \), and \( \max \left\{{\lambda}_{g_j},{\overset{\sim }{\lambda}}_{g_j}\right\} \) provides a weighting factor so that the contribution of the jth deviation to the index φ is proportional to the relative importance among principal components.
Dimensionality reduction technique
Now that we have a metric to measure the distortion caused in the correlation structure of the g group after the incorporation of a new sample, we could then infer to which group the new sample would belong, providing a classification criterion based on how distorted the correlation structure is when incorporating x_{p}. The intuitive way of approaching the evaluation of the distortion would be to integrate x_{p} into X_{g} and (re)calculate the correlation matrix for the further evaluation of its distortion. However, considering that the g group may contain many samples, a single new sample may not be enough to generate a significant distortion in the correlation structure. Furthermore, if the number of samples in the groups is unbalanced, the distortion caused by the inclusion of a new sample may not be comparable.
An approach to overcome this dimensional problem is to randomly subsample a small number of rows in X_{g}, combining them with x_{p}, and then calculating the distortion caused. This approach, however, would not include a considerable amount of information, which is contained in the rows that were left out. To address this issue, we propose a new dimensionality reduction approach that allows a weighted assessment of the distortion in S_{g} caused by the integration of a new sample x_{p}. This approach will use all the information contained in the original data, with the objective of providing a classification algorithm for any upcoming sample.
The first step of the proposed approach is to find an expression for the distorted correlation matrix that reveals the natural weights of the contributions of X_{g} and x_{p} to the makeup of the new correlation structure. Suppose that the data is concatenated as:
where \( \tilde{n}_{g}={n}_g+1 \) is the number of rows of \( {\overset{\sim }{X}}_g \). Combining Eqs. (15) and (8) yields
Normalizing \( {\overset{\sim }{X}}_g \) produces
where \( \tilde{b}_{g} \) is the vector that contains the means of \( {\overset{\sim }{X}}_g \), \( {\overset{\sim }{\Sigma}}_g \) is a diagonal matrix that contains the distorted standard deviations, \( \Delta {b}_g:= \tilde{b}_{g}{b}_g \) is the distortion in the mean vector, and \( {\boldsymbol{x}}_{p_{norm}}=\left({\boldsymbol{x}}_p\tilde{b}_{g}^T\right){\overset{\sim }{\Sigma}}_g^{1} \). Both \( \tilde{b}_{g} \) and \( {\overset{\sim }{\Sigma}}_g \) are unknown. Thus, we need to derive expressions for them. The distorted means vector is calculated as \( \tilde{b}_{g}=\frac{1}{\tilde{n}_{g}}{{\overset{\sim }{X}}_g}^T{I}_{\tilde{n}_{g}} \), which can be converted into:
Equation (18) shows that the natural weights are \( {w}_1=\frac{n_g}{n_g+1} \) and \( {w}_2=\frac{1}{n_g+1} \) for b_{g} and x_{p}, respectively. To find an expression for the diagonal matrix of distorted standard deviations, \( {\overset{\sim }{\Sigma}}_g \), a columnwise subtraction of the mean vector for \( {\overset{\sim }{X}}_g \) is performed:
Adding and subtracting \( {I}_{n_g}{b}_g^T \) to \( {X}_g{I}_{n_g}\tilde{b}_{g}^T \) in Eq. (19) yields:
where
is the ith column of \( {\overset{\sim }{X}}_{g_{mean centered}}\left(:,i\right) \), the corresponding ith variable. Then, the variance of this ith variable will be \( {\overset{\sim }{\sigma}}_{g_i}^2=\frac{1}{\tilde{n}_{g}1}{\left({\overset{\sim }{X}}_{g_{mean centered}}\left(:,i\right)\right)}^T\ {\overset{\sim }{X}}_{g_{mean centered}}\left(:,i\right) \), which can be written as:
Equation (22) can be further expanded as:
Notice that, in this expression, the terms \( \left({X}_g^T\left(:,i\right){b}_g(i){I}_{n_g}^T\right)\left({X}_g\left(:,i\right){b}_g(i){I}_{n_g}\right)=\left({n}_g1\right){\sigma}_{g_i}^2 \), \( {I}_{n_g}^T{I}_{n_g}={n}_g \), and \( \left({X}_g^T\left(:,i\right){b}_g(i){I}_{n_g}^T\right)\Delta {b}_g(i){I}_{n_g}=\Delta {b}_g(i){I}_{n_g}^T\left({X}_g\left(:,i\right){b}_g(i){I}_{n_g}\right) \). Then, Eq. (23) can be reduced to:
Considering that \( \tilde{n}_{g}={n}_g+1 \) and \( {I}_{n_g}^T{X}_g\left(:,i\right)={I}_{n_g}^T\left({b}_g(i){I}_{n_g}\right)={n}_g{b}_g(i) \), it follows that
From Eq. (25), notice that the (distorted) variances of the variables of the group \( {\overset{\sim }{X}}_g \) depend on: (1) the original variances in X_{g}, with natural weight \( \frac{n_g1}{n_g} \); (2) the quadratic (mean centered) values of the new sample, \( {\left({\boldsymbol{x}}_p(i)\tilde{b}_{g}(i)\right)}^2 \), with natural weight \( \frac{1}{n_g} \); and the quadratic values of the distortion in the mean vector, \( {\Delta b}_g^2(i) \). Based on equation [25], the standard deviation matrix for all m variables is
Having expressions for \( \tilde{b}_{g} \) and \( {\overset{\sim }{\Sigma}}_g \), it follows that the distorted correlation matrix is calculated as \( {\overset{\sim }{S}}_g=\frac{1}{\tilde{n}_{g}1}{\overset{\sim }{X}}_{g_{norm}}^T{\overset{\sim }{X}}_{g_{norm}} \) . Combining \( {\overset{\sim }{S}}_g \) with Eq. (17) yields
It follows that,
As \( {X}_{g_{norm}}^T{X}_{g_{norm}}=\left({n}_g1\right){S}_g \), \( {\Sigma}_g{X}_{g_{norm}}^T={X}_g^T{b}_g{I}_{n_g}^T \), \( {X}_{g_{norm}}{\Sigma}_g={X}_g{I}_{n_g}{b}_g^T \), this expression can be expressed as:
Now, as \( {X}_g^T{I}_{n_g}={b}_g{I}_{n_g}^T{I}_{n_g}={I}_{n_g}^T{X}_g={I}_{n_g}^T{I}_{n_g}{b}_g^T={n}_g{b}_g \), the second and third terms of Eq. (29) disappear. Then, the distorted correlation matrix \( {\overset{\sim }{S}}_g \) is given by
Note that, in this expression, \( {\overset{\sim }{S}}_g \) depends on three terms:

1.
\( {\overset{\sim }{\Sigma}}_g^{1}{\Sigma}_g{S}_g{\Sigma}_g{\overset{\sim }{\Sigma}}_g^{1} \), which considers the contributions made from the nondistorted correlation matrix S_{g} after an actualization of the standard deviation, with a natural weight of \( \frac{n_g1}{n_g} \).

2.
\( {\boldsymbol{x}}_{p_{norm}}^T{\boldsymbol{x}}_{p_{norm}} \), which considers the contribution of the new sample to the constitution of the distorted correlation matrix, with a natural weight of \( \frac{1}{n_g} \).

3.
\( {\overset{\sim }{\Sigma}}_g^{1}{\Delta b}_g{\Delta b}_g^T{\overset{\sim }{\Sigma}}_g^{1} \), which considers the effects of the distortion of Σ_{g} and b_{g} in \( {\overset{\sim }{S}}_g \).
Finally, the distortion of the correlation matrix will be measured with the estimation of the deviation between S_{g} and \( {\overset{\sim }{S}}_g \), using the metric \( \varphi \left({S}_g,{\overset{\sim }{S}}_g\right) \) defined in Eq. (14). As previously mentioned, if the number of samples for the group g is large, the integration of x_{p} will barely cause a distortion in the correlation structure, even if it has different features compared to the samples in X_{g}. For example, if X_{g} were composed of 200 samples, the natural relative weight of the mean vector (b_{c}) for the construction of the distorted mean vector would be ~ 0.995, while the natural weight of the sample would (only) be ~ 0.005.
On the other hand, if the weights were calculated assuming that X_{g} is composed of few samples, that is, replacing n_{g} for \( {n}_g^{red} \) (so that \( {n}_g^{red}<{n}_g \)) in the quotients to calculate the relative weights, these weights would be more even and provide a weighting factor for the calculation of the distorted correlation matrix using all the information contained in the original samples of X_{g} (in b_{g}, Σ_{g}, and S_{g}). This is equivalent to finding a generatrix base of a few samples/patients (\( {n}_g^{red} \)) that can represent all the characteristics of X_{g}, incorporate x_{p}, and then evaluate the distortion caused to the correlation structure, providing an artificial dimensional reduction. For example, if the relative weights were calculated assuming that X_{g} is composed only of three samples that exhibit all the attributes of the original dataset (i.e., \( {n}_g^{red}=3 \)), these weights would have the values of 0.75 and 0.25, respectively, for the calculation of the distorted mean vector.
The lower threshold for this artificial dimensional reduction could be found making \( {n}_g^{red}=2 \) in the calculation of the relative weights. If \( {n}_g^{red}=1 \), this would lead to leaving out all the information contained in S_{g} to the estimation of \( {\overset{\sim }{S}}_g \) (see Eq. (30)). A similar result is obtained for the standard deviation (see Eq. (28)).
Proposed classification rule
Now that the artificial dimensional reduction approach has been proposed, it will be used alongside the metric φ for the creation of a tool to classify new samples/patients into either the control or case group. The classifier will work under the assumption that a sample’s likelihood of belonging to either group is inversely proportional to the distortion caused by its incorporation into that group. This classification approach includes the following steps:

1.
Store the new sample in x_{p}.

2.
Define the “maximum artificial dimension” to be evaluated as \( n\le \mathit{\min}\left({n}_c,{n}_v\right)\ \left(n\in {\mathbbm{z}}^{+}\right). \) Choose a dimension “step of change”, \( \Delta n\in {\mathbbm{z}}^{+} \), such as n − 2 is divisible by ∆n. Thus, \( \frac{\left(n2\right)}{\Delta n}+1 \) would define the number of artificial dimensions to be evaluated. Therefore, we set \( {n}_g^{red}=\left(2,2+\Delta n,2+2\Delta n,\dots, n\right) \) for both g = c and g = v.

3.
Evaluate Eqs. (18), (25), (26) and (30) using \( {n}_g^{red} \) instead of n_{g}. Perform this evaluation for both g = c and g = v, and for all values of \( {n}_g^{red} \). Store the resulting distorted correlation matrices as

4.
For each \( {n}_g^{red}=\left(2,2+\Delta n,2+2\Delta n,\dots, n\right) \), calculate
$$ {\left.\left({\psi}_g\right)\right}_{n_g^{red}}:= \frac{1}{\left\varphi \left({S}_g,{\overset{\sim }{S}}_{g_{n_g^{red}}}\right)\right},\kern1.75em g=\left\{c,v\right\} $$(32)where l is the absolute value of l. In consequence, large values of ψ indicate a small distortion in the correlation structure, and therefore, a high degree of affinity between X_{g} and x_{p}. On the other hand, small values of ψ indicate a big distortion and a low degree of affinity between X_{g} and x_{p}.

5.
Calculate the average value for \( {\left.\left({\psi}_g\right)\right}_{n_g^{red}} \) as

6.
Finally, the outcomes of the proposed classification rule, for a single sample, are \( {\overline{\psi}}_c \) and \( {\overline{\psi}}_v \). The method will classify the sample into the group with the greater value of \( {\overline{\psi}}_g \). Figure 4 shows a graphical representation to visualize the outcome of the proposed classification method after classifying a set of new samples onebyone.
Performance assessment with synthetic data
In this section, we assess the performance of the proposed method to correctly classify synthetically generated data.
Synthetic data generation
We conducted in silico experiments to assess the performance of the proposed method under different parameter settings. The following procedure was used to generate synthetic datasets:

1.
Define the quadruplet (n_{i}, m_{j}, ρ_{c}, ρ_{v}). Set n = {20,40,60,80,100,120,140,160}, m = {20,40,60,80,100,120,140}, ρ_{c} = 0.1, ρ_{v} = 0.2.

2.
For every quadruplet in step 1 construct a pair of generatrix correlation matrices, \( {\Sigma}_{c_{j,c}} \) and \( {\Sigma}_{v_{j,v}} \) as \( {\Sigma}_{c_{j,c}}=\left(1{\rho}_c\right){I}_{m_j}+{\rho}_c{1}_{m_j}{1}_{m_j}^T \) and \( {\Sigma}_{v_{j,v}}=\left(1{\rho}_v\right){I}_{m_j}+{\rho}_v{1}_{m_j}{1}_{m_j}^T \), where \( {I}_{m_j}\in {\mathbb{R}}^{m_j\times {m}_j} \) is the identity matrix and \( {1}_{m_j}\in {\mathbb{R}}^{m_j\times 1} \) is column vector of ones.

3.
For every pair \( \left({\Sigma}_{c_{j,c}},{\Sigma}_{v_{j,v}}\right) \), B pairs of Normaldistributed matrices \( {X}_{c_r} \) and \( {X}_{v_r} \) (with r = {1, 2, …, B}) of dimension n_{i} × m_{j} are generated. For this purpose, the NumPy [54] Python package was used. The number of experimental replicates was B = 100.
Performance assessment procedure
We used the correct classification rate (accuracy) as the assessment criterion to measure the performance of our method as follows:

1.
Merge each \( \left({X}_{c_r},{X}_{v_r}\right) \) into a single matrix \( {X}_{Total}=\left[\begin{array}{c}{X}_{c_r}\\ {}{X}_{v_r}\end{array}\right]\in {\mathbb{R}}^{2n\times m} \).

2.
For every pair \( \left({X}_{c_r},{X}_{v_r}\right) \), execute the proposed algorithm with each row sample \( {x}_{p_i}={X}_{Tota{l}_i}\left[i,:\right] \), i = {1, 2, …, 2n}, and classify \( {x}_{p_i} \).

3.
Compute the average classification accuracy as:
$$ \mathrm{Accuracy}=100\times \frac{N}{2n} $$(34)where N is the number of correctly classified samples.
Performance assessment results with synthetic data
Table 3 summarizes the main results. Our method exhibits exceptional accuracy for all the configurations tested. Interestingly, accuracy decreases as the number of features m decreases and the sample size n increases.
Validation with real datasets
In this section, we study the performance of the proposed method using two realworld datasets, which contain OTU count tables obtained through 16S rRNA gene sequencing data from microbiota experiments. We also compare the classification accuracy of our method with those of two stateoftheart methods: SVM [39] and SVMRFE [41].
Datasets
The first dataset is from the American Gut Project (AGP) [51], which is one of the largest crowdfunded microbiome research projects. The second dataset is the Greengenes (GG) database [52], created with the PhyloChip 16s rRNA microarray. For the comparison experiment, only fractions of the datasets were used. In particular, a total of 578 samples and 127 features comprised the AGP data set, while 500 samples and 26 features comprised the GG data set. In both data sets, 50% of the samples correspond to cases.
Validation scenarios results
Datasets were preprocessed as described in section “Data pretreatment”. Further, the proposed method, as well as the SVM and SVMRFE methods, were applied after separating the whole data set into training, testing, and validation sets using 70, 20, and 10% of the data, respectively. For the SVMRFE method, the number of features to select was \( {n}_{features}=\left\{5,10,15,\frac{n_{features}}{2}\right\} \) and the average of the results was calculated. The tuning parameters used for the SVM and SVMRFE methods were C = 1 and γ = 0.05, where C trades off the correct classification of training examples against the maximization of the decision function’s margin, and γ defines how far the influence of a single training example reaches.
Table 4 shows the main results. For the AGP data set, SVM is the least accurate, and SVMRFE has the highest accuracy. This latter result is mostly due to all the strong features of SVM and the ability of the SVMRFE method to eliminate variables that are not highly relevant in the data. Interestingly, our method outperforms SVM and is a close competitor of SVMRFE.
For the GG dataset, although the number of variables is small, the SVMRFE and our method showed accuracy values above 90%, while the accuracy for the SVM method is below this threshold. It is worth highlighting that, for this data set, our method outperforms both the SVM and SVMRFE methods. The latter result is thanks to the artificial dimensional reduction conducted to balance the natural weights when the number of samples is greater than the number of variables. Figure 5 provides a graphical illustration of the proposed method’s classification outcome for both real datasets used for validation, i.e., the AGP and the GG.
Discussion and conclusions
The ability to characterize populations of patients, species, or biological features, usually comprising a large number of variables in order to use the extracted characteristics to classify new samples into one of such populations’ categories is a relevant tool for biological and medical studies. When data describing these populations is compositional, further limitations and challenges arise.
Here, we proposed a new method to classify samples into one of two previously known categories. The method uses a new metric developed to quantify the overall correlation structure deviation between two datasets, and a new dimensionality reduction technique. Although we illustrated the usefulness of our proposal with compositional data, its application is not limited, under any circumstances, to data of this nature. In fact, when data is not compositional, the centered logratio transformation and the zeroreplacement algorithm must not be applied.
Validation with synthetic data showed that the proposed method achieves accuracy values above 98%. Moreover, comparison of the performance of our method with that of SVM and the SVMRFE (i.e., two stateoftheart classification techniques), using two realworld datasets from 16 s RNA sequencing experiments, showed that our method outperforms the SVM method in both data sets, outperforms the SVMRFE method in the GG data set, and is a close competitor of the SVMRFE method in the AGP data set.
Future studies may address the ability of our proposed method to perform accurately for a broader range of dimensions (number of variables and samples) and assess its performance for more scenarios of dissimilar correlation structures other than that for ρ_{c} = 0.1 and ρ_{v} = 0.2. Moreover, our method may be extrapolated for multicategory classification, and a performance assessment may be conducted to test its classification accuracy in nonbinary scenarios.
Availability of data and materials
The source code, implemented in Python 3, is readily available in the following GitHub site: https://github.com/JoaoRacedo/arn_seq_pipeline. This code generates synthetic datasets to demonstrate the use of the pipeline. The American Gut Project’s datasets can be found on the following website: http://americangut.org. Finally, the Greengenes’ datasets can be found on: https://greengenes.lbl.gov/Download/OTUs/.
Abbreviations
 AGP:

American gut project
 alr:

Additive loratio
 ANN:

Artificial Neural Networks
 BM:

Bayesian multiplicative (algorithm)
 clr:

Centered logratio
 CSVMRMFE:

Correlation based support vector machine–recursive multiple feature elimination
 GG:

Greengenes (database)
 GrBoost:

Gradient boosting
 ilr:

Isometric logratio
 NN:

Nearest Neighbor
 OTU:

Operational taxonomic unit
 PCA:

Principal component analysis
 rRNA:

Ribosomal ribonucleic acid
 SPIECEASI:

Sparse inverse covariance estimation for ecological association inference
 SVM:

Support vector machine
 SVMRFE:

Support vector machine recursive feature elimination
References
 1.
Turnbaugh PJ, Ley RE, Hamady M, FraserLiggett CM, Knight R, Gordon JI. The Human Microbiome Project. Nature [Internet]. 2007;449(7164):804–10. Available from: https://doi.org/10.1038/nature06244.
 2.
Kitano H. Looking beyond the details: a rise in systemoriented approaches in genetics and molecular biology. Curr Genet [Internet]. 2002 [cited 2019 Nov 13];41(1):1–10. Available from: https://doi.org/10.1007/s002940020285z.
 3.
Oltvai ZN. Life’s complexity pyramid Zoltán N. Oltvai. 2010;763(2002).
 4.
Kitano H. Systems biology: a brief overview. 2015;(April 2002).
 5.
Voorhies AA, Ott CM, Mehta S, Pierson DL, Crucian BE, Feiveson A, et al. Study of the impact of longduration space missions at the International Space Station on the astronaut microbiome. Sci Rep [Internet]. 2019;1–17. Available from: https://doi.org/10.1038/s41598019463038
 6.
Somerville C, Somerville S. Plant functional genomics. Science. 1999;285(5426):380–3.
 7.
Gill R, Datta S, Datta S. A statistical framework for differential network analysis from microarray data. BMC Bioinformatics. 2010;11(1):95.
 8.
Gill R, Datta S, Datta S. dna: an R package for differential network analysis. Bioinformation. 2014;10(4):233.
 9.
Juric D, Lacayo NJ, Ramsey MC, Racevskis J, Wiernik PH, Rowe JM, et al. Differential gene expression patterns and interaction networks in BCRABL—positive and—negative adult acute lymphoblastic leukemias. J Clin Oncol. 2007;25(11):1341–9.
 10.
Van Treuren W, Ren B, Gevers D, Kugathasan S, Denson LA, Va Y, et al. Resource the treatmentnaive microbiome in newonset Crohn’s disease. Cell Host Microbe. 2014;15:382–92.
 11.
Ruan D, Young A, Montana G. Differential analysis of biological networks. BMC Bioinformatics. 2015;16(1):327.
 12.
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: opensource, platformindependent, communitysupported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75(23):7537–41.
 13.
Rao KR, Lakshminarayanan S. Partial correlation based variable selection approach for multivariate data classification methods. Chemom Intell Lab Syst. 2007;86(1):68–81.
 14.
Kurtz ZD, Müller CL, Miraldi ER, Littman DR, Blaser MJ, Bonneau RA. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 2015;11(5):e1004226.
 15.
Aitchison J. The statistical analysis of compositional data. J R Stat Soc Ser B. 1982:139–77.
 16.
Filzmoser P, Hron K, Reimann C. Science of the Total Environment Univariate statistical analysis of environmental (compositional) data: problems and possibilities. Sci Total Environ [Internet]. 2009;407(23):6100–8. Available from: https://doi.org/10.1016/j.scitotenv.2009.08.008.
 17.
Clark C, Kalita J. A comparison of algorithms for the pairwise alignment of biological networks. Bioinformatics [Internet]. 2014;30(16):2351–9. Available from: https://doi.org/10.1093/bioinformatics/btu307.
 18.
Atchison J, Shen SM. Logisticnormal distributions: some properties and uses. Biometrika. 1980;67(2):261–72.
 19.
Aitchison J. A new approach to null correlations of proportions. J Int Assoc Math Geol. 1981;13(2):175–89.
 20.
Egozcue JJ, PawlowskyGlahn V, MateuFigueras G, BarcelóVidal C. Isometric Logratio transformations for compositional data analysis. Math Geol [Internet]. 2003;35(3):279–300. Available from: https://doi.org/10.1023/A:1023818214614.
 21.
Greenacre M, Grunsky E. The isometric logratio transformation in compositional data analysis: a practical evaluation. 2019.
 22.
Pan M, Zhang J. Correlationbased linear discriminant classification for gene expression data. Genet Mol Res. 2017;16(1).
 23.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinforma 2015;2015.
 24.
Goswami S, Chakrabarti A, Chakraborty B. Analysis of correlation structure of data set for efficient pattern classification. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF); 2015. p. 24–9.
 25.
Russell EL, Chiang LH, Braatz RD. Datadriven methods for fault detection and diagnosis in chemical processes. New York: Springer Science & Business Media; 2012.
 26.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.
 27.
Serban N, CritchleyThorne R, Lee P, Holmes S. Gene expression network analysis and applications to immunology. Bioinformatics. 2007;23(7):850–8.
 28.
Friedman J, Alm EJ. Inferring correlation networks from genomic survey data. PLoS Comput Biol. 2012;8(9):e1002687.
 29.
Radovic M, Ghalwash M, Filipovic N, Obradovic Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics. 2017;18(1):1–14.
 30.
Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.
 31.
Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial markergene surveys. Nat Methods. 2013;10(12):1200–2.
 32.
Kavitha KR, Rajendran GS, Varsha J. A correlation based SVMrecursive multiple feature elimination classifier for breast cancer disease using microarray. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI); 2016. p. 2677–83.
 33.
Collins GS, Mallett S, Omar O, Yu LM. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 2011;9(1):103.
 34.
Aarøe J, Lindahl T, Dumeaux V, Sæbø S, Tobin D, Hagen N, et al. Gene expression profiling of peripheral blood cells for early detection of breast cancer. Breast Cancer Res. 2010;12(1):R7.
 35.
Datta S. Classification of breast cancer versus normal samples from mass spectrometry profiles using linear discriminant analysis of important features selected by random forest. Stat Appl Genet Mol Biol. 2008;7(2).
 36.
Šonka M, Hlaváč V, Boyle R. Image processing, analysis, and machine vision. International Student Edition; 2008.
 37.
Dembélé D, Kastner P. Fold change rank ordering statistics: a new method for detecting differentially expressed genes. BMC Bioinformatics. 2014;15(1):14.
 38.
Bevilacqua V, Mastronardi G, Menolascina F, Paradiso A, Tommasi S. Genetic algorithms and artificial neural networks in microarray data analysis: a distributed approach. Eng Lett. 2006;13(4).
 39.
Ca DAV, Mc V. Gene expression data classification using support vector machine and mutual informationbased gene selection. Proc Comput Sci. 2015;47:13–21.
 40.
van Dam S, Vosa U, van der Graaf A, Franke L, de Magalhaes JP. Gene coexpression analysis for functional classification and genedisease predictions. Brief Bioinform. 2018;19(4):575–92.
 41.
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97(457):77–87.
 42.
Bhuvaneswari V, et al. Classification of microarray gene expression data by gene combinations using fuzzy logic (MGCFL). Int J Comput Sci Eng Appl. 2012;2(4):79.
 43.
Belciug S, Gorunescu F. Learning a singlehidden layer feedforward neural network using a rank correlationbased strategy with application to high dimensional gene expression and proteomic spectra datasets in cancer detection. J Biomed Inform. 2018;83:159–66.
 44.
Dettling M, Bühlmann P. Boosting for tumor classification with gene expression data. Bioinformatics. 2003;19(9):1061–9.
 45.
Friedman J, Hastie T, Tibshirani R, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat. 2000;28(2):337–407.
 46.
Freund Y, Schapire RE. A decisiontheoretic generalization of online learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39.
 47.
Fix E, Hodges Jr JL. Discriminatory analysisnonparametric discrimination: small sample performance; 1952.
 48.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton, FL: CRC Press; 1984.
 49.
MartínFernández JA, Hron K, Templ M, Filzmoser P, PalareaAlbaladejo J. Bayesianmultiplicative treatment of count zeros in compositional data sets. Stat Modelling. 2015;15(2):134–58.
 50.
Pearson K. Mathematical contributions to the theory of evolution—on a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc Lond. 1897;60(359–367):489–98.
 51.
McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, et al. American Gut: an open platform for citizen science microbiome research. Msystems. 2018;3(3):e00031–18.
 52.
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a chimerachecked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72(7):5069–72.
Acknowledgements and funding
This study was financed by COLCIENCIAS grant No. 121556934635, contract 07702013. By the time this work was developed, IP was a doctoral student at Universidad del Norte, Colombia, whose PhD was funded by COLCIENCIAS and Gobernación del Atlántico (Colombia), grant No. 673 (2014), “Formación de Capital Humano de Alto Nivel para el Departamento del Atlántico”. EZ and HSJV are supported in part by award No. R01AI110385 from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD, USA. JIV was partially supported by research grant FOFICO 32101 PE0031 from Universidad del Norte, Barranquilla, Colombia.
Author information
Affiliations
Contributions
Technique design: SR, IP, EZ. Algorithms implementation: SR, IP. Experimental design: JIV, EZ, HSJV, MS. Writing of the manuscript: SR, IP, EZ, JIV, MS, HSJV.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Racedo, S., Portnoy, I., Vélez, J.I. et al. A new pipeline for structural characterization and classification of RNASeq microbiome data. BioData Mining 14, 31 (2021). https://doi.org/10.1186/s13040021002667
Received:
Accepted:
Published:
Keywords
 Microbial communities
 Compositional nature
 Classification method
 16 rRNA sequencing