A new pipeline for structural characterization and classification of RNA-Seq microbiome data

Background High-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs. Results Here, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method’s classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method’s performance with that of two state-of-the-art methods. Conclusions Simulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.


Background
Microorganisms living inside and on humans are known as the microbiota. When integrated with their genes' information, it is known as the microbiome. The Human Microbiome Project (HMP) was an endeavor for the characterization of the human microbiota to further understanding its impact on human health and diseases [1].
In recent years, biological sciences have experienced substantial technological advances that have led to the rediscovery of systems biology [2][3][4]. These advances were possible thanks to the technological ability to completely sequence the genome from any organism at a low cost [5,6]. Such advances triggered the development of various analytic approaches and technologies to simultaneously monitoring all the components within cells (e.g., genes and proteins). With the genome information and analytic technologies, the mining and exploration of the resulting data opened up the possibility to better understand biological systems, such as microbial populations, and their complexity. The network structure of such biological systems can give insight into the underlying interactions taking place within those systems [7][8][9][10]. Furthermore, the understanding of these interactions can lead to the discovery of new methods that can help physicians, biologists, scientists, and healthcare workers with disease diagnosis, gene identification, classification of new data, and many other tasks [11].
We initially conducted a literature search in different medical, biological, and engineering databases as well as academic sites prestigious journals such as BMC Bioinformatics, PLOS ONE, ScienceDirect, and IEEE Xplore using the queries "correlation structure for gene expression classifications," "classifiers for compositional data," and "classifiers based on correlation structures" in order to identify papers in English using procedures for sample classification based on correlation structures in the 2009-2019 time window. Figure 1 shows the evolution of the number of publications retrieved when the keywords "correlation structure for gene expression classifications" are used. Publications were retrieved from several academic sites, namely BMC Bioinformatics, PLOS One, ScienceDirect, and Scopus. Figure 2 summarizes the current principal stages of gene expression analysis for sample classification.
Operational Taxonomic Unit (OTU) count tables are the usual output when processing the 16S rRNA sequences of microbiota samples [12]. These tables show the relative abundances of the bacteria that make a microbiota population (e.g., the human gut microbiota). OTU-based data have a compositional nature, which makes them difficult to work with [13,14]. Thus, data transformation is required prior to any further analysis.
Aitchison [15] proposed two transformations to compensate for the data's compositionality, thus allowing the use of standard metrics in further analysis. The first transformation is the additive log-ratio (alr), which is defined as: where x j is an element of {x 1 , x 2 , x 3 …, x n }. Because one value x j is selected as the denominator to build the log-ratios, the alr has been criticized as being subjective since the outcome depends mostly on the value of x j selected [15][16][17][18]. The second transformation proposed by Aitchison is the centered log-ratio (clr), which is defined as: where gðxÞ ¼ ð Q n i¼1 x i Þ 1 n is the geometric mean. The use of g(x) avoids the subjectivity of the alr transformation since the method is taking all the information of x [15][16][17][18][19]. The clr transformation has proven to be reliable and has been extensively used in the scientific literature over the years to analyze microbiome data.
In [20] authors proposed a transformation called the isometric log-ratio (irl) transformation. This approach takes any compositional data x ∈ S N , and computes ilr(x) = z = [z 1 , z 2 , …, z N ], where z i is calculated as: However, implementing the ilr transformation poses serious practical difficulties for high-dimension data as the computational complexity increases rapidly with dimensionality [21].

Feature selection
After transforming the data, the next step is to separate the data into train, test, and validation sets, although in some cases only the train and test sets are considered. One of the most common problems prior to that step is the limitation of the number of data samples. Indeed, for a normal classifier to be employed using multivariate metrical techniques, the sample size required for optimum training is in order of thousands. This is known as the "curse of dimensionality" problem, and the usual way to overcome this limitation is by using a dimensionality reduction technique to collapse all the attributes (variables) into a lower-dimension space where the most dominant information of the dataset can be retrieved [13,22].
Feature selection methods are usually separated into three categories: filter, wrapper, and embedded. Table 1 summarizes different approaches for feature selection in gene expression data, the most relevant categories for feature selection, and the current weaknesses when analyzing gene expression data. Filter methods can work with univariate and multivariate data, where univariate methods focus on each feature separately and multivariate methods focus on finding relationships between features [23,24]. Here we only consider multivariate methods.
The abovementioned filter methods tend to be computationally efficient. Wrapper methods, on the other hand, tend to have a better performance in selecting features since they take a model hypothesis into account, meaning that a training and testing procedure is made in the feature space. However, this approach is computationally inefficient and is more problematic as the feature space grows [23,26,29,30]. Embedded methods make the feature selection based on the classifier (i.e., selected features might not work with any other classifier) and hence tend to have a better computational performance than wrappers. This is the case because the optimal set of descriptors is built when the classifier is constructed and the feature selection is affected by the hypotheses made by the classifier [23,26,[29][30][31].
In [14], authors presented SParse InversE Covariance Estimation for Ecological ASsociation Inference (SPIEC-EASI), a novel strategy to infer networks from a high dimensional community compositional data. SPIEC-EASI estimates the interaction graph from the transformed data using either Recursive Feature selection or Sparse Inverse Covariance selection and seeks to infer an underlying graphical model using conditional independence. In [32] authors proposed a modification of the Support Vector Machine Table 1 Summary of feature selection approaches in gene expression analysis

Category Description Weaknesses References
Filter -Extract features from the data without any type of learning involved.
-Classifier dependent selection. [23,26,[29][30][31] -Recursive Feature Elimination (SVM-RFE) algorithm for feature selection. SVM-RFE removes one irrelevant feature at each iteration, but this can be troublesome when the number of features is large. Thus, its modification, namely Correlation based Support Vector Machine -Recursive Multiple Feature Elimination (CSVM-RMFE), finds the correlated features and removes more than one irrelevant feature per iteration. Rao and S. Lakshminarayanan [13] presented a new significant attribute selection method based on the Partial Correlation Coefficient Matrix (PCCM).

Classification
The final step after finding the most relevant features of the transformed data is to select a classifier. In clinical and bioinformatic research, prediction models are extensively used to derive classification rules useful to accurately predict whether a patient has or would develop a disease, whether the treatment is going to work, or even whether a disease would recur [33][34][35]. Table 2 summarizes the relevant aspects of some widely used classifiers. Depending on the data, a classifier can belong to one of two groups: supervised or unsupervised [36]. In supervised classification (learning), samples are labeled according to some a priori-defined classes or categories, whereas in unsupervised learning, samples are not labeled, and the classifier clusters the data into different classes or categories after maximizing or minimizing a set of criteria.
Dembélé and Kastner [37] presented a new Fold Change method that can detect differentially expressed genes in microarray data. The traditional fold change method works by calculating the ratio between the averages from the samples (usually two different biological conditions, e.g., control and case samples). Then, cutoff values (e.g., 0.5 for down-and 2 for up-regulated) are used to select genes under/above such thresholds. This new approach is more accurate and faster than the traditional method and can assign a metric to each differentially expressed gene, which can be used as a selection criterion.
Belciug and F. Gorunescu [43] proposed a novel initialization of a single hidden layer feedforward neural network's input weights using the knowledge embedded in the connections between variables and class labels. The authors expressed this by the nonparametric Goodman-Kruskal Gamma rank correlation instead of the traditional random initialization. The use of this correlation also helped to increase computational speed by eliminating unnecessary features based on the significance of the rank correlation between variables and class labels. Boosting -LogitBoost, AdaBoost.M1, GradientBoosting (GrBoost) [13,14,38,39,44] In [42], authors proposed a framework to find information about genes and to classify gene combinations belonging to its relevant subtype using fuzzy logic, which adapts numerical data (input/output pairs) into human linguistic terms, offering good capabilities to deal with noisy and missing data. However, defining the rules and membership functions might require a lot of prior knowledge from a human expert [41]. Dettling and P. Bühlmann [44] proposed a boosting method combining a dimensionality reduction step with the LogitBoost algorithm [45] and compared it to AdaBoost.M1 [46], the nearest neighbor classifier [47], and classification and regression trees (CART) using gene expression data [48]. Dettling and P. Bühlmann showed that, for low dimensional data, LogitBoost can perform slightly better than AdaBoost.M1, and that for real high dimensional data, their approach can outperform the other classifiers in some cases.
In this paper, we present a new method to classify samples into two groups with different characteristics (i.e., phenotypes, health condition, among others) when data of compositional nature is available. Our method relies on a new metric to quantitatively characterize the overall correlation structure deviation when comparing the two datasets and a new dimensionality reduction approach. The proposed method is assessed and compared, based on classification accuracy, to two state-of-the-art methods using both synthetic datasets and real datasets from RNA-16s sequencing experiments.

Proposed classification method
Here, we explain in detail the proposed classification method. First, in section "Data pretreatment", we introduce the Data Pretreatment stage, and in section "Assessing correlation structure distortion", a novel metric to be used as the metric to assess correlation structure distortion is described. Finally, in section "Dimensionality reduction technique", we present the proposed classification rule, which is based on the previously defined metric and a proposed dimensionality-reduction approach to assess the disruption of a dataset's correlation structure after a new sample is included.

Data pretreatment
Let X ρ c ∈ℝ n c Âm and X ρ v ∈ℝ n v Âm be the OTU count tables where m features are assessed in n c and n v samples from control and case individuals, respectively. In the expressions above, the superindex ρ indicates the datasets are 'raw' or without pretreatment. From now on, X ρ g will represent any of the two groups (g = c for control, or g = v for case). When analyzing OTU counts tables, a log-ratio transformation, such as the clr, is to be applied [15,18,19] before estimating correlations. However, in order to apply the log-ratio transformation, it is necessary to consider that compositional count datasets may contain null values resulting from insufficiently large or non-existing samples. As log-ratio transformations require data with exclusively positive values, the use of a zero-replacement method is a must. Here we use the Bayesian-multiplicative (BM) algorithm proposed by Martín-Fernández [49]. Let x p i ∈ℝ 1 × m be the i-th row of the matrix X ρ g (i = 1, 2, …, n g ). The BM algorithm replaces the null counts by When using the Bayes-Laplace prior, we set n ¼ P m j¼1 x p i; j , t i, j = m −1 and s i = m. Let X BM g ≔BMðX ρ g Þ be the resulting matrix after the BM algorithm is applied row-wise to X ρ g .
To ensure the data's compositionality on X BM g , a closure operation [15,18,19] is applied to every row of X BM g , as follows: where k is an arbitrary constant (usually k = 100). Let X BM;c g ≔cðBMðX ρ g ÞÞ be the resulting matrix after the BM algorithm and the closure operation have been applied. Now, the clr transformation is applied to each vector x p ∈ ℝ 1 × n X BM;c g , as n is the geometric mean. Hence, Finally, a normalization is applied, resulting in: where I g ¼ ½1 1…:1∈ℝ n g Â1 is a column vector of ones, b g ∈ℝ n g Â1 is a column vector that contains the means of all the variables in X g , and Σ g ∈ ℝ m × m is a diagonal matrix that contains the standard deviation (σ g i , for i = 1, …, m) of all variables.

Assessing correlation structure distortion
Here, we introduce φ, a new metric to quantitatively assess the distortion in the correlation structure of a dataset after the incorporation of a new sample. The Pearson correlation matrix for X g is calculated as follows [50]: Now, consider a new sample, x p ∈ ℝ 1 × m . The pretreatment step for this sample yields: LetX g ℝ n g Âm be the (augmented) dataset X g after incorporating the new sample, and let S g andS g be the correlation matrices for X g andX g , respectively. The spectral decomposition for these matrices is where are diagonal matrices containing the eigenvalues for S g andS g . Let ∈ℝ mÂm be the eigenvector matrices of S g andS g . Figure 3a illustrates, in a 2-dimensional example, the datasets X g andX g . Figure 3b illustrates the datasets after carrying out the pre-treatment, along with their eigenvectors (which are unitary) scaled by their corresponding eigenvalues obtained from the spectral decompositions. Note that scaled eigenvectors mark out the directions of largest variability, capturing high order interactions between the OTUs ruling the overall association structure. Therefore, looking at deviations in both the magnitude and direction of those scaled eigenvectors must give insightful information on overall changes in the association structure of a microbiota population. Based on the abovementioned remarks, we introduce φ to characterize the distortion produced in the underlying correlation structure when two OTU counts datasets are compared. This metric first requires a dimensional reduction, which will be performed by selecting the principal components for each sample group. This procedure, integrated within the Principal Component Analysis (PCA) algorithm [25], consists of finding the minimum number of eigenvalues a g orã g (for X g andX g , respectively) that explain 100(1 − α)% of the total variance, i.e.: Fig. 3 Bidimensional representation of datasetsX g and X g a without pretreatment, and b after the pretreatment along with the eigenvectors scaled by the corresponding eigenvalues Thus, φ is defined as where ðλ g j −λ g j Þ is the algebraic difference (magnitude deviation) of the j-th eigenvalues in Λ g andΛ g , cos −1 ðv T g jṽ g j Þ computes angular deviation between the j-th eigenvectors in V g andṼ g , and maxfλ g j ;λ g j g provides a weighting factor so that the contribution of the j-th deviation to the index φ is proportional to the relative importance among principal components.

Dimensionality reduction technique
Now that we have a metric to measure the distortion caused in the correlation structure of the g group after the incorporation of a new sample, we could then infer to which group the new sample would belong, providing a classification criterion based on how distorted the correlation structure is when incorporating x p . The intuitive way of approaching the evaluation of the distortion would be to integrate x p into X g and (re)calculate the correlation matrix for the further evaluation of its distortion. However, considering that the g group may contain many samples, a single new sample may not be enough to generate a significant distortion in the correlation structure. Furthermore, if the number of samples in the groups is unbalanced, the distortion caused by the inclusion of a new sample may not be comparable.
An approach to overcome this dimensional problem is to randomly subsample a small number of rows in X g , combining them with x p , and then calculating the distortion caused. This approach, however, would not include a considerable amount of information, which is contained in the rows that were left out. To address this issue, we propose a new dimensionality reduction approach that allows a weighted assessment of the distortion in S g caused by the integration of a new sample x p . This approach will use all the information contained in the original data, with the objective of providing a classification algorithm for any upcoming sample.
The first step of the proposed approach is to find an expression for the distorted correlation matrix that reveals the natural weights of the contributions of X g and x p to the make-up of the new correlation structure. Suppose that the data is concatenated as: whereñ g ¼ n g þ 1 is the number of rows ofX g . Combining Eqs. (15) and (8) yields NormalizingX g produces whereb g is the vector that contains the means ofX g ,Σ g is a diagonal matrix that contains the distorted standard deviations, Δb g ≔b g −b g is the distortion in the mean vector, g . Bothb g andΣ g are unknown. Thus, we need to derive expressions for them. The distorted means vector is calculated asb g ¼ 1 n gX g T Iñ g , which can be converted into: Equation (18) shows that the natural weights are w 1 ¼ n g n g þ1 and w 2 ¼ 1 n g þ1 for b g and x p , respectively. To find an expression for the diagonal matrix of distorted standard deviations,Σ g , a column-wise subtraction of the mean vector forX g is performed: Adding and subtracting I n g b T g to X g −I n gb T g in Eq. (19) yields: wherẽ X g mean−centered :; i ð Þ ¼ is the i-th column ofX g mean−centered ð:; iÞ, the corresponding i-th variable. Then, the variance of this i-th variable will beσ 2 g i ¼ 1 n g −1 ðX g mean−centered ð:; iÞÞ TX g mean−centered ð:; iÞ , which can be written as: Equation (22) can be further expanded as: Notice that, in this expression, the terms ðX T g ð:; iÞ−b g ðiÞI T n g ÞðX g ð:; iÞ−b g ðiÞI n g Þ ¼ ðn g −1 Þσ 2 g i , I T n g I n g ¼ n g , and ðX T g ð:; iÞ−b g ðiÞI T n g ÞΔb g ðiÞI n g ¼ Δb g ðiÞI T n g ðX g ð:; iÞ−b g ðiÞI n g Þ . Then, Eq. (23) can be reduced to: Considering thatñ g ¼ n g þ 1 and I T n g X g ð:; iÞ ¼ I T n g ðb g ðiÞI n g Þ ¼ n g b g ðiÞ, it follows that Having expressions forb g andΣ g , it follows that the distorted correlation matrix is T g normX g norm . CombiningS g with Eq. (17) yields It follows that, As X T g norm X g norm ¼ ðn g −1ÞS g , Σ g X T g norm ¼ X T g −b g I T n g , X g norm Σ g ¼ X g −I n g b T g , this expression can be expressed as: Now, as X T g I n g ¼ b g I T n g I n g ¼ I T n g X g ¼ I T n g I n g b T g ¼ n g b g , the second and third terms of Eq. (29) disappear. Then, the distorted correlation matrixS g is given bỹ Note that, in this expression,S g depends on three terms: g , which considers the contributions made from the non-distorted correlation matrix S g after an actualization of the standard deviation, with a natural weight of n g −1 n g .
2. x T p norm x p norm , which considers the contribution of the new sample to the constitution of the distorted correlation matrix, with a natural weight of 1 n g . 3.Σ −1 g Δb g Δb T gΣ −1 g , which considers the effects of the distortion of Σ g and b g inS g .
Finally, the distortion of the correlation matrix will be measured with the estimation of the deviation between S g andS g , using the metric φðS g ;S g Þ defined in Eq. (14). As previously mentioned, if the number of samples for the group g is large, the integration of x p will barely cause a distortion in the correlation structure, even if it has different features compared to the samples in X g . For example, if X g were composed of 200 samples, the natural relative weight of the mean vector (b c ) for the construction of the distorted mean vector would be~0.995, while the natural weight of the sample would (only) be~0.005.
On the other hand, if the weights were calculated assuming that X g is composed of few samples, that is, replacing n g for n red g (so that n red g < n g ) in the quotients to calculate the relative weights, these weights would be more even and provide a weighting factor for the calculation of the distorted correlation matrix using all the information contained in the original samples of X g (in b g , Σ g , and S g ). This is equivalent to finding a generatrix base of a few samples/patients (n red g ) that can represent all the characteristics of X g , incorporate x p , and then evaluate the distortion caused to the correlation structure, providing an artificial dimensional reduction. For example, if the relative weights were calculated assuming that X g is composed only of three samples that exhibit all the attributes of the original dataset (i.e., n red g ¼ 3), these weights would have the values of 0.75 and 0.25, respectively, for the calculation of the distorted mean vector.
The lower threshold for this artificial dimensional reduction could be found making n red g ¼ 2 in the calculation of the relative weights. If n red g ¼ 1, this would lead to leaving out all the information contained in S g to the estimation ofS g (see Eq. (30)). A similar result is obtained for the standard deviation (see Eq. (28)).

Proposed classification rule
Now that the artificial dimensional reduction approach has been proposed, it will be used alongside the metric φ for the creation of a tool to classify new samples/patients into either the control or case group. The classifier will work under the assumption that a sample's likelihood of belonging to either group is inversely proportional to the distortion caused by its incorporation into that group. This classification approach includes the following steps: 1. Store the new sample in x p . 2. Define the "maximum artificial dimension" to be evaluated as n ≤ minðn c ; n v Þ ðn∈ z þ Þ: Choose a dimension "step of change", Δn∈z þ , such as n − 2 is divisible by Δn.
3. Evaluate Eqs. (18), (25), (26) and (30) using n red g instead of n g . Perform this evaluation for both g = c and g = v, and for all values of n red g . Store the resulting distorted correlation matrices as 4. For each n red g ¼ ð2; 2 þ Δn; 2 þ 2Δn; …; nÞ, calculate where |l| is the absolute value of l. In consequence, large values of ψ indicate a small distortion in the correlation structure, and therefore, a high degree of affinity between X g and x p . On the other hand, small values of ψ indicate a big distortion and a low degree of affinity between X g and x p . 5. Calculate the average value for ðψ g Þj n red g as 6. Finally, the outcomes of the proposed classification rule, for a single sample, are ψ c and ψ v . The method will classify the sample into the group with the greater value of ψ g . Figure 4 shows a graphical representation to visualize the outcome of the proposed classification method after classifying a set of new samples one-by-one.

Performance assessment with synthetic data
In this section, we assess the performance of the proposed method to correctly classify synthetically generated data.

Synthetic data generation
We conducted in silico experiments to assess the performance of the proposed method under different parameter settings. The following procedure was used to generate synthetic datasets:

Performance assessment procedure
We used the correct classification rate (accuracy) as the assessment criterion to measure the performance of our method as follows: 2. For every pair ðX c r ; X v r Þ, execute the proposed algorithm with each row sample x p i ¼ X Total i ½i; :, i = {1, 2, …, 2n}, and classify x p i .
3. Compute the average classification accuracy as: where N is the number of correctly classified samples. Performance assessment results with synthetic data Table 3 summarizes the main results. Our method exhibits exceptional accuracy for all the configurations tested. Interestingly, accuracy decreases as the number of features m decreases and the sample size n increases.

Validation with real datasets
In this section, we study the performance of the proposed method using two real-world datasets, which contain OTU count tables obtained through 16S rRNA gene sequencing data from microbiota experiments. We also compare the classification accuracy of our method with those of two state-of-the-art methods: SVM [39] and SVM-RFE [41].

Datasets
The first dataset is from the American Gut Project (AGP) [51], which is one of the largest crowd-funded microbiome research projects. The second dataset is the Greengenes (GG) database [52], created with the PhyloChip 16s rRNA microarray. For the comparison experiment, only fractions of the datasets were used. In particular, a total of 578 samples and 127 features comprised the AGP data set, while 500 samples and 26 features comprised the GG data set. In both data sets, 50% of the samples correspond to cases.

Validation scenarios results
Datasets were preprocessed as described in section "Data pretreatment". Further, the proposed method, as well as the SVM and SVM-RFE methods, were applied after separating the whole data set into training, testing, and validation sets using 70, 20, and 10% of the data, respectively. For the SVM-RFE method, the number of features to select was n features ¼ f5; 10; 15; n features 2 g and the average of the results was calculated. The tuning parameters used for the SVM and SVM-RFE methods were C = 1 and γ = 0.05, where C trades off the correct classification of training examples against the maximization of the decision function's margin, and γ defines how far the influence of a single training example reaches. Table 4 shows the main results. For the AGP data set, SVM is the least accurate, and SVM-RFE has the highest accuracy. This latter result is mostly due to all the strong features of SVM and the ability of the SVM-RFE method to eliminate variables that are not highly relevant in the data. Interestingly, our method outperforms SVM and is a close competitor of SVM-RFE.
For the GG dataset, although the number of variables is small, the SVM-RFE and our method showed accuracy values above 90%, while the accuracy for the SVM method is below this threshold. It is worth highlighting that, for this data set, our method outperforms both the SVM and SVM-RFE methods. The latter result is thanks to the artificial dimensional reduction conducted to balance the natural weights when the number of samples is greater than the number of variables. Figure 5 provides a graphical illustration of the proposed method's classification outcome for both real datasets used for validation, i.e., the AGP and the GG.

Discussion and conclusions
The ability to characterize populations of patients, species, or biological features, usually comprising a large number of variables in order to use the extracted characteristics to classify new samples into one of such populations' categories is a relevant tool for biological and medical studies. When data describing these populations is compositional, further limitations and challenges arise.
Here, we proposed a new method to classify samples into one of two previously known categories. The method uses a new metric developed to quantify the overall correlation structure deviation between two datasets, and a new dimensionality reduction technique. Although we illustrated the usefulness of our proposal with compositional data, its application is not limited, under any circumstances, to data of this nature. In fact, when data is not compositional, the centered log-ratio transformation and the zero-replacement algorithm must not be applied.
Validation with synthetic data showed that the proposed method achieves accuracy values above 98%. Moreover, comparison of the performance of our method with that of SVM and the SVM-RFE (i.e., two state-of-the-art classification techniques), using two real-world datasets from 16 s RNA sequencing experiments, showed that our method outperforms the SVM method in both data sets, outperforms the SVM-RFE method in the GG data set, and is a close competitor of the SVM-RFE method in the AGP data set.
Future studies may address the ability of our proposed method to perform accurately for a broader range of dimensions (number of variables and samples) and assess its performance for more scenarios of dissimilar correlation structures other than that for ρ c = 0.1 and ρ v = 0.2. Moreover, our method may be extrapolated for multi-category classification, and a performance assessment may be conducted to test its classification accuracy in non-binary scenarios.