A biplot correlation range for group-wise metabolite selection in mass spectrometry

Background Analytic methods are available to acquire extensive metabolic information in a cost-effective manner for personalized medicine, yet disease risk and diagnosis mostly rely upon individual biomarkers based on statistical principles of false discovery rate and correlation. Due to functional redundancies and multiple layers of regulation in complex biologic systems, individual biomarkers, while useful, are inherently limited in disease characterization. Data reduction and discriminant analysis tools such as principal component analysis (PCA), partial least squares (PLS), or orthogonal PLS (O-PLS) provide approaches to separate the metabolic phenotypes, but do not offer a statistical basis for selection of group-wise metabolites as contributors to metabolic phenotypes. Methods We present a dimensionality-reduction based approach termed ‘biplot correlation range (BCR)’ that uses biplot correlation analysis with direct orthogonal signal correction and PLS to provide the group-wise selection of metabolic markers contributing to metabolic phenotypes. Results Using a simulated multiple-layer system that often arises in complex biologic systems, we show the feasibility and superiority of the proposed approach in comparison of existing approaches based on false discovery rate and correlation. To demonstrate the proposed method in a real-life dataset, we used LC-MS based metabolomics to determine spectrum of metabolites present in liver mitochondria from wild-type (WT) mice and thioredoxin-2 transgenic (TG) mice. We select discriminatory variables in terms of increased score in the direction of class identity using BCR. The results show that BCR provides means to identify metabolites contributing to class separation in a manner that a statistical method by false discovery rate or statistical total correlation spectroscopy can hardly find in complex data analysis for predictive health and personalized medicine. Electronic supplementary material The online version of this article (10.1186/s13040-019-0191-2) contains supplementary material, which is available to authorized users.


Introduction
Contemporary analytic methods, such as liquid chromatography-mass spectrometry (LC-MS) [1,2], gas chromatography-mass spectrometry (GC-MS) [3,4], and proton nuclear magnetic resonance (1H NMR) spectroscopy [5,6], provide information-rich data sets that can be of substantial value in biomedical research and, in principle, can be developed with bioinformatics procedures for routine healthcare [7][8][9]. Challenges in clinical use exist at two levels, reliable extraction of metabolic features from spectroscopic data and reliable identification of metabolic features associated with health characteristics. Substantial progress has been made in data extraction, with several high-quality routines available. For instance, recent introduction of adaptive processing by apLCMS [10] provides a systematic approach to reduce noise and extract relative quantification of > 7000 metabolic features in 50 aliquots of human plasma in 20 min (2); current improvements in data processing have demonstrated that > 12,000 metabolic features can be extracted [11]. This high volume of information, which is inherently multivariate, presents challenges to reliable use in health prediction and disease management.
Statistical methods based upon the principles of false discovery rate (FDR) are available to correct for large numbers of comparisons in multiple hypothesis testing of metabolomics data [12]. These methods are useful to identify potential biomarkers associated with disease or disease risk while controlling the expected proportion of incorrectly rejected null hypotheses (type-I error). This approach is effective because it yields single biomarker candidates that can be rigorously tested and directly used in health research and clinical practice.
Individual biomarkers, however, can be of limited value in practical use. For instance, biomarkers with a relatively good specificity (e.g., 0.9) and sensitivity (e.g., 0.9) still result in large numbers of misclassifications, i.e., one diagnosis in ten will be wrong and one in ten will be missed. While several factors can contribute, a central limitation is that statistical procedures examining individual variables do not consider how variables interact and combine. In complex pathobiology, individuals with the same genetic mutation have different disease phenotypes, e.g., some patients with a sickle cell disease mutation have hemolytic crises while others with the same mutation have painful crises with bony infarcts, acute chest syndrome or only mild anemia [13]. At the molecular level, functional redundancies and multiple interacting levels of regulation within network structures result in second-order and higher order interactions that allow the same pathway to respond differently among individuals. Additionally, metabolic responses can be conditional because of genetic and epigenetic differences, as well as differences in diet, environment or health behaviors. For instance, decreased plasma cystine in response to zinc supplementation may not only be due to zinc-dependent effects on cystine uptake and conversion to glutathione by tissues [14,15], but also upon intestinal absorption, renal loss, rates of transcription of relevant regulatory systems and past exposures that alter epigenetic regulation [16,17]. Such complexity means that individual biomarkers can rarely, if ever, be universally useful. Consequently, statistical approaches equivalent to FDR, when conducting multiple comparisons, are needed to identify metabolites important in group-wise (e.g., metabolic pathway and network) behavior, thereby providing rigorous bases to include metabolic interactions within complex metabolic datasets for improved disease classification and health prediction.
In this study, we propose a general dimensionality-reduction based approach for potential biomarker selection in spectroscopic data, which we term 'biplot correlation range' (BCR). The approach uses loading vectors from principal component analysis (PCA), partial least squares (PLS, also called projection to latent structures), and orthogonal-signal-correction PLS (OPLS), to directly link to correlation analysis for group separation. The analysis determines a correlation range from scores for a group label on loading vectors rather than from individual correlations for each variable. The use of correlation range to describe how variables combine to form observable and discriminatory patterns is derived from established data reduction and multivariate techniques, (i.e., PCA, PLS, and OPLS), and methods to discover new variables describing otherwise hidden, lower-dimensional structure. Extracted representations are transformed into new data (score vectors) using a relatively small number of newly selected variables (loading vectors comprised of the original variable contributions). These new variables have improved power to discriminate samples linked to phenotype, such as pathological characteristic (Y as response variable). A specific advantage of this approach is that multiple components, none of which may be significantly associated with Y when evaluated individually by statistical tests such as FDR, can interact to discriminate Y using BCR.
In development of the proposed approach, we relied upon a commonly used graphical technique of matching score plots and loading plots, called as a biplot method [18]. Cloarec et al. used a two-step approach to facilitate biomarker detection in 1H NMR spectroscopy by graphically coupling a loading vector from OPLS and the correlation of each variable with response Y [19]. In the development of BCR, we similarly create a loading vector for each metabolite contributing to separation. A subsequent selection of metabolites with defined correlation interval (e.g., 95%) is used to determine metabolites related with defined classes. This allows individual metabolites contributing to separation to be visualized in respective loading plots, thereby providing a rigorously defined approach to identify metabolites contributing to group behavior. We explore whether BCR would determine a correlation range using scores and loadings in PCA, extending them to PLS and OPLS, and biomarkers for the purpose of discrimination analysis of mass spectral data from mitochondria isolated from wild-type (WT) mice and thioredoxin-2 (Trx2) TG mice. This study showed that BCR provides means to select metabolites contributing to class separation in a manner that can complement FDR in complex data analysis for predictive health and personalized medicine.

Methods
In this section, we introduce the theoretical background of a biplot and its interpretation from a correlation viewpoint with regard to the development of biplot correlation statistics.

Biplot
A biplot is constructed by using a dimensionality-reduction technique to obtain a low-dimensional approximation to a transformed version of a data matrix, X in size n × p, where n and p denote the number of samples (observations) and features (variables), respectively. The most popular dimensionality-reduction technique is singular value decomposition (SVD) which brings forth principal component analysis (PCA). Other techniques such as multidimensional scaling and partial least squares (PLS, also called projection to latent structures) are also available [20][21][22]. They, however, share the same spirit with SVD in a sense that the low-dimensional approximation of X often unravels hidden structures in X by maintaining inter-sample distances as much and capturing as much variation of X as possible.
For n centered p × 1 observations x = [x 1 ⋯x p ] T and its corresponding n × 1 response vector Y, data matrix X consists of x i , i = 1, … , n (all up to n samples): We find loading vectors a j of size p × 1, j = 1, … , p, and their associated score vectors t j of size n × 1. Using SVD, we exactly obtain loading vector a j by an eigenvector of the sample covariance matrix and score vector t j by Score vector t j corresponding to loading vector a j represents new coordinates of n data on the axis of a j . The mth component [a j ] m describes the amount of contribution of the original (before-transformation) mth variable to the construction of new axis a j .
Simply speaking, a larger [a j ] m value is associated with more weight for the mth variable in new axis a j . In practical use, before the application of dimensionality reduction, one could apply unit variance scaling for each column in X to provide all variables an equal weight. This scaling step, optional, depends on the domain characteristic of the data. For example, many biologic processes are determined by high abundance components, and because of this, variance scaling can sometimes result in loss of useful information by decreasing the contribution of more relevant, high-abundance variables and increasing contribution of non-relevant, low abundance variables.
We order loading vectors, a 1 , …, a p , according to their associated eigenvalues and p score vectors, t 1 , …, t p will follow accordingly. Then we rewrite the data matrix X as X ¼ P p j¼1 t j a T j : A biplot is formed by the first two dominant terms from two scatterplots of ([t 1 ] i , [t 2 ] i ) for i = 1, …, n, and ([a 1 ] m , [a 2 ] m ), denoted by a ! m , for m = 1, …, p that share a common set of axes. In essence, excluding the constant term, the sample covariance matrix approximates to X T X≅a 1 t T 1 t 1 a T 1 þ⋯þa k t T k t k a T k . Fig. 1 (a) shows an exemplary biplot that has the simplified combination of a principal component score plot by '+' markers and a principal component loading plot by 'o' markers.

Biplot correlation range
We present an interpretation of the direction and magnitude in a biplot in terms of correlation among variables and relate it to a procedure for biplot based discrimination analysis. Using the approximation of the sample covariance matrix in the construction of a biplot, the sample covariance between the qth and rth variables, d cov ðx q ; x r Þ , is given by where 〈•, •〉 represents an inner product with a weight vector ½t T 1 t 1 t T 2 t 2 :It implies that the inner product between a ! q and a ! r corresponds to a covariance measure between the two variables. Observe that a ! q and a ! r are shown as two loading vectors in a biplot. Then, given loading vector a ! q ¼ð½a 1 q ; ½a 2 q Þ, it is straightforward that the direction of a ! r should be the same as that of a ! q to maximize where θ is the angle between the two vectors and the equality holds true for θ = 0. In specific, the cosine of the angle between a ! q and a ! r is related to a correlation measure, Overall, the direction of a ! j for the jth variable linking to correlation of variables is the direction to which the variable contributes in increasing scores on new axes a 1 and a 2 . Similarly, the magnitude of a ! j is associated with the variable's contribution in a b Fig. 1 a The combination of a principal component score plot and a principal component loading plot, called a biplot, is used to illustrate the concept of biplot correlation range. Based on the "+" score plot values and the associated ellipse (broken line), a 95% confidence interval is calculated for projection onto the corresponding loading plot (small circles). Filled red circles contribute to scores within the 95% correlation range, while the thick-lined ellipse represents the top 5% of features contributing to the "+" class. b The flow of the decomposition of input data X magnitude of the score increasing, linking to covariance of variables. For the sake of convenience, a ! j is used as the contributing direction of the jth variable. When dimensionality-reduction techniques such as PLS and OPLS other than SVD approximate data matrix X in a similar fashion, the BCR approach is applied similarly using their loading vectors. Among numerous dimensionality-reduction techniques implemented and tested, we choose the combination of direct signal correction and PLS to obtain components able to separate response vector Y with interpretability [23]. The flow of the decomposition is shown in Fig. 1 (b). We extract a signal score vector, t (ortho) , orthogonal to response vector Y with maximum variance in data matrix X. In specific, forŶ, the projection of Y on the column space of X, we numerically obtain the eigenvector with the largest eigenvalue, set to be t (ortho) , in a subspace of X orthogonal toŶ , ðI−ŶðŶ TŶ Þ −1Ŷ T ÞX; and then compute its loading vector a ðorthoÞ ¼X T t ðorthoÞ ðt T ðorthoÞ t ðorthoÞ Þ −1 . We consider t ðorthoÞ a T ðorthoÞ as a residual in X in the task of explaining Y. Then we perform PLS with input data as X−t ðorthoÞ a T ðorthoÞ , corrected by the Y-orthogonal signal in X, and output data Y, obtaining its first score vector, t (pred) , and loading vector, a (pred) . We note that the use of direct signal correction in X by the first orthogonal component beforehand helps the first component of PLS to effectively capture Y-separating patterns in X and produces interpretability when using the two components. Finally, we approximate data matrix X into one orthogonal component and another predictive component as follows: Score and loading vectors in the above decomposition retain the same interpretation as in PCA. We notice that the decomposition without an orthogonal component is equivalent to PLS and that numerous orthogonal and PLS components are possible in addition to various feature scaling methods. Typically, some kind of cross validation in combination with the domain characteristic of features is used to optimize the separability of the whole procedure.
Then, using the properties of loading vectors as discussed above, variables contributing significantly to a certain group label can be identified as in Fig. 1. First, we construct a biplot as described before and collect the scores of samples belonging to a certain group label. Often score vectors are scaled appropriately so that they may be placed outside loading vectors for the sake of clarity. By default, we multiplied the scores by 0.001. As the next step shows, since we consider loading vectors among themselves using their directions and magnitudes, such scaling does not matter. Then we use a 95% confidence interval fitting the scores belonging to a certain group with a multivariate normal distribution, forming a correlation range based on the angle interpretation in biplots. In Fig. 1, it is illustrated by the '+' markers representing scores of samples belonging to a group and the ellipsoid with a broken line representing a 95% confidence interval for those scores. Next, variable j corresponding to direction a ! j that contributes to increasing scores within the 95% confidence interval range (edge bordered by diagonal green lines in Fig. 1), are collected. The filled red circles in Fig. 1 illustrate collected variables. The two (green) diagonal solid lines that border on the 95% confidence interval represent a biplot correlation range (BCR). We notice that the correlation range is invariant to the scaling of score vectors in consideration of the diagonal lines from the zero. Finally, the top 5% (τ) of variables in magnitude of contributing direction a ! j are selected. Note that the stringency can be increased by use of top 1% or 0.5% of such variables. These features, illustrated as filled circles in the solid ellipsoid, greatly contribute to the group label in a covariance (magnitude) sense. We perform the above procedures with scores of each group label, generating selected features per group label, as in Figs. 3(a) and 5(a), and finally obtaining the union of them. Since the selected features are treated equally as long as they are within the top 5% criterion, post-analysis such as ordering them by correlation, covariance, p-values, or variable importance projection (VIP) values will be possible [24]. By default, we filter out variables of which individual regression performance measures with response Y are weak. Practically, we test if the p-value of a logistic regression model with response Y and the raw values of each individual variable x i is greater than 0.10, without controlling familywise error rate or false discovery rate, to deem features of weak separability. Depending on the nature of variables and the problem domain, one could adopt other performance measures and tests such as Pearson correlation, Spearman's rank correlation, p values of linear regression, and classification accuracy of logistic regression. This step eliminates unnecessary noisy features that act as contributing features in a collective sense from the previous step. We repeat the above steps for each group label, and the outcomes of these steps are lists of greatly contributing features for each group label.
This BCR approach is based on the graphical use of a biplot in that scores and loadings are used collectively. BCR, however, enables discrimination analysis and a feature selection procedure using the interpretation of loading vectors while the biplot approach provides only a graphical presentation of scores and loadings. We note that BCR relies upon the approximation of data matrix X≅t 1 a T 1 þt 2 a T 2 using orthogonal and predictive components and the statistical properties of loading vector a ! j and score vector t ! j and it provides a structured approach to select features collectively significant.

Results
To test the BCR approach, we performed a simulation study and compared it with some existing methods. Then, we applied it to a real-life example.

Simulation and comparisons
We first generated a data matrix X (200 × 1000) and a response vector Y (200 × 1), comprising 200 samples and 1000 variables. Each element of Y was a Bernoulli random variable with a success probability of 0.4, so 16 elements of Y were set to 1 on average, while the remaining 24 elements were set to 0. To generate 1000 variables, we used a three-layer network structure shown in Fig. 2.
In the first layer, the first 30 variables were generated to have high separation in Y: for p = 1, …, 4, for p = 5, …, 8, where U(a, b) is a random variable from a uniform distribution of range a and b. It means variables from x 1 to x 4 individually had a high positive correlation with Y label 0, whereas variables x 5 to x 8 highly correlated with Y label 1. In specific, the correlation between each of variables from x 1 to x 4 and Y was 0.9576 ± 0.0075 (average ± standard deviation) throughout the simulation, and that between each of variables x 5 to x 8 and Y was 0.9577 ± 0.0075. Figure 2(b) shows plots of realizations of x 1 and x 5 and the aforementioned pattern that clearly and individually separate Y. The first eight variables in the first layer represent strong individual variables that should be used to identify pathological conditions. For p = 9, 12, …, 18, and for p = 19, 22, 25, 28, where A¼ 1 0:4 0:4 0:4  (d) Three variables x 9 , x 10 , and x 11 in Layer 1, jointly separate the two labels of response Y of x 9 and x 10 that collectively separate Y. Figure 2(d) also illustrates that realizations of x 28 , x 29 and x 30 , generated as above, clearly and jointly discriminate Y. The variables x i , i = 9, …, 30, in the first layer represent strong group-wise variables that clearly discriminate pathological conditions. The next 90 variables from x 31 to x 120 in the second layer and the next 270 variables from x 121 to x 390 in the third layer were generated so that the variables contribute to the overall response Y in a composite and aggregate manner. For instance, x 1 in the first layer is clearly separable by the combination of x 31 , x 32 and x 33 , in the second layer. In specific, the generation of the three variables is based on the value of x 1 so that the sum of the three will be close to x 1 as follows: given x 1 , we independently generate u 1 , u 2 and u 3 from U(0,1) and ϵ from, Nð0; x 1 10 Þ. Then, we set which brings x 31 + x 32 + x 33 = x 1 + ϵ. The correlation between the sum of the three and x 1 was 0.9730 ± 0.0085 throughout the simulation, and the generation method was similarly applied to the according matches in Fig. 2 (a). The remaining 610 variables from x 391 to x 1000 , comprising a noise layer, were randomly and independently generated from N(0, 1) to avoid a strong correlation with Y in order to simulate inherent noise. To better simulate the effect of randomness in noise, the p-value of a linearregression fitting of Y with each x i from the 610 variables was controlled so that pvalue ≥δ i , where δ i was set to 0, 0.03, 0.05, or 0.10. The condition, δ i =0, represents that the variables are completely random noise and some of them can be significant, and the condition, δ i =0.10, represents the generated variables are deemed insignificant by significance level 0.10. The simulated three-layer structure is an example of multiple layers of metabolic interactions and regulation in complex biological systems. For instance, a biological system for nutritional metabolomics reflects such a layered structure with linked transports [25]. To further simulate biological systems with fewer biomarkers, we also used the two-layer structure of layers 1 and 2 only, in which the remaining 880 variables were simulated noise, and the one-layer structure of only layer 1, in which the remaining 930 variables were simulated noise. To compare them with random-noise systems as a baseline performance, we also used a structure, denoted by noise layer structure, in which the total 1000 variables are simulated noise. The variables as simulated noise were generated as described above by varying δ i . For comparison, we examined the ability of statistical total correlation spectroscopy coupled with OPLS (STOCSYO) [19] and false discovery rate (FDR) methods to detect the known variables from layers 1, 2, and 3. Statistical total correlation spectroscopy is an analysis method for aiding the identification of potential biomarkers in metabolomics studies by displaying the correlation among the intensities of the various peaks among the whole sample, and its combination with OPLS discriminant analysis, in particular, offers a powerful framework for selecting important variables [26,27]. We used two versions of FDR using p-values from t-tests with two unpaired sets of x i values when Y=0 and those when Y=1; one is a classic one (FDR1) by Benjamini and Hochberg [12] and the other one (FDR2), by Benjamini and Yekutieli, considers multiple testing under dependency [28]. We also mention that another version of FDR using p-values from logistic regression was tested but that no variables are found throughout the experiments. Thus, we dropped logistic regression based FDR in the following results. We chose STOCSYO as a representative method using correlation measures and FDR as using p-values. The BCR approach used the first orthogonal and the first predictive components from OPLS. The selection of important variables in a b Fig. 3 This figure illustrates feature selection using BCS for a simulated data set from the model in Fig. 2. a A biplot for the first orthogonal and predictive components is shown, clearing separating the two labels of response Y with a 95% correlation range for each label. b The corresponding loading plot, a zoomed-in version of the dotted-line rectangle in (a), is shown; the top 5% significantly contributing variables found from Layer 1, 2, and 3 are shown as red squares, blue diamonds, and black triangles, respectively. The eight strong variables (x 1 to x 8 ) in the plot are clearly distinguishable and positioned accordingly to the labels. For example, the position of x 1 is aligned in the direction of label 0, consistent with the behavior of x 1 in Fig. 2(b) STOCSYO was set as the cut-off value of a correlation coefficient corresponding to significance levels varying from 1 to 20% [19,26,29]. The top percentile (τ) in BCR and the q-value for FDR also varied from 1 to 20%. We note that the adopted levels for the methods are not strictly comparable metrics by themselves, yet we compare them in that they are used in practice to adjust the number of selected variables. Figure 3 (a) and (b) show the scores and loadings, respectively, for the BCR method using a multiple-layer data set: in the loading plot, red squares, blue diamonds, and black triangles represent variables that are deemed to be greatly contributing from layers 1, 2, and 3, respectively. We note that the first eight variables (x 1 to x 8 ) were correctly found, and several additional variables from layers 2 and 3 were also detected. We notice that the loadings of the eight strong variables (x 1 to x 8 ) are greatly larger than those of others, for examples, layer 2 variables (x 9 to x 30 ) in the predictive component axis corresponding to loading vector a (pred) as shown in Fig. 3 (b). It is understandable in view of that the loading vector in PLS is obtained as slope coefficients to predict X corrected by the Y-orthogonal signal, X−t ðorthoÞ a T ðorthoÞ , by the score vector. We notice that the PLS score vector is calculated on the direction which maximizes the covariance between X and Y. It is worth mentioning that the first eight variables were positioned accordingly to the labels. The position of x 1 , for example, was aligned to the direction of label 0, in accordance with the behavior of x 1 in Fig. 2 (b). This implies that an increase in x 1 results in an increase in label 0.
We repeated this test 1000 times for each of the three kinds of layer structures. In each repetition for each method, we counted the numbers of distinguishing variables that were correctly found within the three layers. None of the noisy variables was selected by any of the methods for the three layers structures when δ i =0.10 while some noisy variables were selected for the other δ i conditions. The averaged numbers found by the various methods in the three-layer, two-layer, one-layer, and noise-layer structure are presented in Tables 1, 2, 3, and 4, respectively. For the three-layer structure as shown in Table 1, the BCR method consistently found more variables in all the layers than STOCSYO and FDR. In layer 1, BCR outperformed STOCSYO and FDR for all levels and all noise conditions except level 0.01 with regard to the number of variables found. We also notice that the tested methods managed to find significant variables in layer 1, which is reasonable in that the methods are able to identify strong single variables. In layers 2 and 3, we observe the BCR method found more variables than the other methods. This result is understandable because BCR looks for a combination of variables rather than a single variable to separate labels, while STOCSYO emphasizes individual correlation-wise weights and FDR focuses on the effect of an outstanding single variable. In the noise layers, the BCR and STOC methods found more noise variables than the other methods for δ i = 0, and the BCR only found noise variables for δ i = 0.05. No noise variables were found for all the methods when δ i = 0.10. This result implies that the BCR method tends to identify noise variables, possibly leading to false positives, when the randomness in noise increases. To validate the identified noise variables are false positives in discrimination analysis, we performed logistic regression for Y using the detected noise variables only. The p-values and classification accuracy are shown in Table 5 for the three-layer structure. Clearly, the logistic regression models are quite much significant with the p-values close to zero, and the classification Table 1 Average numbers of variables found in the simulation study for the three-layer structure  Tables S3-S5, respectively, show the similar result as in Table 5 for the detected noise variables. The averaged numbers found by the various methods in the two-layer structure are presented in Table 5; For all levels, BCR still found more variables in layer 1 than STOCSYO and FDR. The BCR method also found more variables in layer 2 than STOCSYO and FDR. We also observe STOCSYO also detected more variables in layer 2 than FDR. The averaged numbers found by the various methods in the one-layer structure are presented in Table 1; BCR outperformed STOCSYO and FDR for all levels with regard to the number of variables found. The average number of variables in each layer filtered out by the BCS method in the layer structures along with the averaged Table 2 Average numbers of variables found in the simulation study for the two-layer structure  p-values are presented in Additional files 4 and 5: Tables S6 and S7 according to the noise condition and tested level. Clearly, the number of the filtered variables increases as the noise conditions and levels increase. For example, the number of the filtered variables for levels ≤0.05 is quite much less than those for levels >0.05.It also increases as the structure moves from the three layers to the noise layers, which means increasing randomness. Consistently, most of the filtered variables appeared in the last noise layer, which practically demonstrates the use of the filtering step. For example, in the two-layer structure with noise condition 0 and level 0.10, the average number of the filtered variables in the noise layer is 64.69 while that in layer 1 is 0.52 and that in layer 2 is 10.18. Though small in number, the filtered variables in layer 1 partly explains the BCR never finds all 30 variables in layer 1. Additionally, Fig. 4 and figures in Additional file 6: Figure S1 show the number of the selected variables by the four tested methods during the 1000 iterations in noise  , the BCS method found variables, x 9 to x 30 , in layer 1, as well as variables in layers 2 and 3, more frequently than the others. The tendency, however, weakens as the layer structure moves from the three-layer one to the one-layer one. Overall the FDR methods are strict in capturing variables, STOCSYO remains in between FDR and BCS, and BCS finds variables not only individually strong but also collectively separating.

Application to real-life biological data
To apply the proposed method, we examined a high-resolution metabolomics data set from a recent study of mitochondrial metabolomics of thioredoxin-2-overexpressing Table 4 Average numbers of variables found in the simulation study for the noise-layer structure  [30]. Thioredoxin (Trx2) is a small protein that regulates reduction-oxidation balance. The chosen dataset comprised anion exchange-high-resolution mass spectrometry data of mitochondria from 18 WT and 19 TG mice. Metabolic data were extracted from mass spectral analyses using apLCMS [10] and comprised high-resolution m/z features defined by m/z, retention time, and intensity. Each sample was analyzed in duplicate, and data for duplicates were averaged. Features with ≥30% missing values were excluded, resulting in 677 features for each sample. The included missing values were replaced by zero since no noticeable peaks at the m/z features were found as in [31]. Comparison of WT and TG data using FDR at q = 0.05 or q = 0.2 resulted in no features being detected as different. Similarly, STOCSYO detected no significant features. Application of BCR to identify features contributing to the separation of WT and TG mitochondria by the first orthogonal and the first predictive components from OPLS resulted in the identification of 64 features, as shown in Fig. 5 (See also Additional file 7: Table S1).
As post-analysis we annotated the selected metabolites using the Metlin mass spectrometry database [32]. To determine the associated pathobiology, we applied KEGG (The Kyoto Encyclopedia of Genes and Genomes)-database pathway analysis [33]. Of the 64 features identified by BCR, the 45 had a variable importance projection score ≥ 1 [24]. These 45 discriminatory features were annotated in the Metlin database using   Table S2). Twelve of the 45 features were mapped using KEGG Mus musculus pathway analysis as shown in Fig.6. More than two-thirds of the features did not match metabolites in the KEGG database, being considered as false positives in practice from the KEGG-database viewpoint. Phosphatidylcholine  mice [34]. In addition, Al-Orf found that excess and persistent intake of oxidized phosphatidylcholine caused significant damage to organs in male Wistar albino rats [35]. Thioredoxin overexpression in mice has been shown to attenuate oxidative stress [36]. a b Fig. 5 BCS analysis using orthogonal signal correction of high-resolution metabolomic data from liver mitochondria from thioredoxin-2 transgenic (TG) mice and wild-type (WT) littermate controls. a A two-dimensional score plot of orthogonal signal correction shows the first predictive component as a function of the first orthogonal component. b Corresponding loading plot, a zoomed-in version of the dotted-line rectangle of (a), with the top 5% of features to the separation of TG and WT metabolic profiles within the 95% correlation range contributing. A list of the 64 discriminatory m/z features identified is provided in Additional file 7: Table S1 The discovery of discriminating metabolites related to sphingosine was unanticipated but reasonable in terms of what is known about ceramide metabolism. Ceramide is an endogenous mediator of apoptotic cell death. For example, when the intracellular concentration of ceramide is elevated under oxidative stress, cellular proliferation is inhibited, and cellular apoptosis is induced [37]. Ceramide is synthesized at the endoplasmic reticulum from palmitoyl-CoA and serine, resulting in 3-ketosphinganine. The enzyme 3-ketosphinganine reductase generates sphinganine from 3-ketosphinganine. Sphinganine is acylated to dihydroceramide by sphinganine N-acyl-transferase. Finally, dihydroceramide is converted to ceramide by the activity of the dihydroceramide desaturase [38][39][40]. In this study, we observed a reduced amount of 3-ketosphinganine (300.28 m/z) in Trx2-overexpressing TG mice, suggesting that Trx2 decreases levels of 3-ketosphinganine, thereby conferring protection against apoptosis ( Fig. 7 (b)). Thus, the discrimination of WT and TG mitochondria by 3-ketospingosine is consistent with available data on mitochondria, ceramide metabolism, and Trx2 protection against apoptosis signaling.
The discrimination of WT and TG mitochondria by guanosine monophosphate (GMP) at 364.06 m/z is also reasonable because GMP increases antioxidant function and attenuates oxidant cell death [41,42]. Consistent with the anti-apoptotic effect of GMP, we observed increased GMP in Trx2 TG mice compared to that in WT mice, providing important evidence of overexpression of Trx2 (Fig. 7 (c)). The discrimination of WT and TG mitochondria by GMP is consistent with available data on mitochondria, the anti-apoptotic effect of GMP on oxidative stress, and Trx2 protection against apoptosis signaling.  6 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of the 45 discriminatory features identified using the BCS approach. Forty-five discriminatory features are mapped onto 12 Mus musculus (mouse) KEGG metabolites (black dots). Those are involved in phosphotidylcholine, sphingosine, cysteine, methionine, and purine metabolism. The distributions suggest that multiple factors associated with Trx overexpression can be conceptualized using the group-wise feature selection provided by the BCS approach Several methods currently exist to identify metabolites that are significantly different according to sample classification based upon the principles of FDR. Comparable methods to test for group behavior of metabolites in sample classification, however, do not exist. Data reduction methods are available to reduce a large number of variables into a smaller set of variables; this allows separation of classes according to the group behavior of metabolites. Previous graphical methods have resulted in identification of individual metabolites that contribute to group behavior; however, no criteria for inclusion or exclusion of metabolites were provided. Our newly developed method, BCR, uses statistical criteria for selection of metabolites contributing to group behavior. Evaluation of its performance with both simulated data and real data demonstrated its utility. The BCR method employs statistical principles to select variables that contribute group-wise to class discrimination. The method allows reproducible selection of metabolites that contribute to class separation, thereby facilitating practical developments in metabolomics research. Application of BCR could, in principle, provide a simple means to detect group-wise behavior of metabolites connected to different pathways and metabolic networks.

Conclusions
We developed a dimensionality-reduction based approach termed a biplot correlation range that improves reliability of selection of metabolites contributing to group behavior for use in metabolic profiling applications for personalized medicine. Original variable interactions were used to assign scores according to group identity, and statistical principles were used to select variables in terms of increased score in the direction of a a b c d Fig. 7 The relative concentrations of mitochondrial metabolites in WT and TG (a) phosphocholine, (b) 3ketosphinganin, (c) guanosine monophosphate, (d) Phosphotidyl choline C (18:2) group identity within a correlation range. Testing by simulation and application to real data showed that this method improved selection of variables collectively responsible for group behavior. By providing a statistical basis differently from FDR and OPLS-coupled STOCSY approaches, the proposed method can reveal important metabolites that contribute to group behavior for analysis of complex metabolic data sets. As a future research direction, more rigorous add-on analysis of selected important metabolites such as the calculation of p-values by cross validation, sensitivity analysis of selection of components, and systemic post-analysis are in need of investigation.