 Methodology
 Open access
 Published:
Feature selection using distributions of orthogonal PLS regression vectors in spectral data
BioData Mining volume 14, Article number: 7 (2021)
Abstract
Feature selection, which is important for successful analysis of chemometric data, aims to produce parsimonious and predictive models. Partial least squares (PLS) regression is one of the main methods in chemometrics for analyzing multivariate data with input X and response Y by modeling the covariance structure in the X and Y spaces. Recently, orthogonal projections to latent structures (OPLS) has been widely used in processing multivariate data because OPLS improves the interpretability of PLS models by removing systematic variation in the X space not correlated to Y. The purpose of this paper is to present a feature selection method of multivariate data through orthogonal PLS regression (OPLSR), which combines orthogonal signal correction with PLS. The presented method generates empirical distributions of features effects upon Y in OPLSR vectors via permutation tests and examines the significance of the effects of the input features on Y. We show the performance of the proposed method using a simulation study in which a threelayer network structure exists in compared with the false discovery rate method. To demonstrate this method, we apply it to both reallife NIR spectra data and mass spectrometry data.
Introduction
Feature selection is a technique to select a subset of variables which are useful in predicting target responses. From a machine learning viewpoint, irrelevant features in a prediction model deteriorate its generalization ability, and it is critical to remove such redundant features to keep the model from being misled by inappropriate information. The task of feature selection, one of the central tasks in machine learning, helps to reduce overfitting by eliminating redundant features. The prevalence of highdimensional data becomes a challenge for researchers to perform feature selection [1]. In biological fields, particulary, high dimensionality often arises, and irrelevant and redundant features make up a high proportion of the total data [2].
Indeed, contemporary analytic methods such as nearinfrared (NIR), proton nuclear magnetic resonance (^{1}H NMR) spectroscopy, liquid chromatographymass spectrometry (LCMS), and gas chromatographymass spectrometry (GCMS) provide highdimensional data sets in which the number of features is usually larger than the number of observations. Those spectral data sets, denoted by the input variables (feature) X, are an effective alternative to using classical chemical analyses in screening experiments [3]. They can reveal the underlying patterns associated with health characteristics, the socalled phenotypes. These include pathological characteristics, denoted by the response variables Y, and thus can be of substantial value in biomedical research [4]. To this end, reliable identification of features associated with response characteristics is important.
Uncovering hidden patterns associated with the response variables in spectral data sets is not trivial. One of the major problems in addressing this issue is how to deal with the existence of spectral collinearity. Spectral collinearity, originating from linear dependence among the input variables X, poses illconditioned linear equation systems. Thus, standard regression methods cannot be applied in this context, so different strategies need to be adopted. The methods usually employed for solving such problems are artificial neural networks, knearest neighbors, principle component analysis (PCA), and partial least squares (or projections to latent structures, PLS) among others. A detailed discussion of multivariate statistical methods in spectral data sets can be found in Holmes and Antti [5] and Lindon et al. [6].
Among these, unsupervised PCA and supervised PLS together with regression have been widely applied. These methods are useful for reducing the complexity in the feature space, the main idea is to find a low dimensional representation while retaining as much of the variation as possible. Compared to other methods, they are not only easy to interpret but also effective in explaining the interrelationships among spectral data by examining the variance levels of spectral data while being less computationally demanding. Basically, they produce a few transformed scores (positions for new directions) for the original input variables X to reduce the complexity of such data and retain most of the variational information.
For the task of identifying significant variables, however, produced scores by PCA often undergo lack of discrimination power since it only focuses on new directions or loadings, accounting for the maximum variation of X, and then projects the original input variables onto the new directions [7]. On the other hand, PLS is a type of regression model used to find the relationships between the response variables and input variables based on the assumption that they are generated by a common set of underlying factors [8]. That is to say, it finds the directions in the space of X that explains the maximum of variation of the space Y. By reducing the collinearity between input variables and increasing covariance between input and response variables at the same time, feature selection by PLS can result in a more parsimonious model without losing its predictive ability.
The task of identifying significant features via PLS faces a few challenges. The selection of original spectral variables from those transformed scores is nontrivial because they are represented as linear combinations of a large number of the original input variables. PLS loadings express the weights for these linear combinations and generate PLS regression vectors with estimated regression coefficients. The uncertainty in the estimates of PLS loadings and scores in conjunction with PLS regression vectors complicates this task. Since error in a few original input variables are propagated in all transformed scores by way of loadings, the uncertainty in PLS regression vectors increases accordingly [9, 10]. Additionally, the absence of closedform distributions of PLS loadings also makes the task challenging. To cope with these challenges in PLSbased variable selection, several statistical approaches have been suggested. Heise applied crossvalidation with a strategy of expanding neighboring variables [11]. Høskuldsson employed the stepwise regression approach based on the goodness of fit criterion [12]. Faber proposed a resampling technique [13], focusing on variables interval selection. These approaches, though known to be unbiased [14], inevitably suffer from uncertainty under a substantial number of features and often lead to the selection of unnecessary features. The presence of unnecessary features, causing overfitting, is a critical issue especially when the number of observations is less than that of features. It would be a benefit to apply a resampling method, able to regulate the selection of unnecessary features, to the identification of significant features in PLS.
This paper introduces a useful combined approach of applying orthogonal signal correction (OSC) and permutation tests to PLS for the purpose of feature selection. OSC, introduced by Wold et al. [15], removes from input variables only the part unrelated to the response variable. Extensions of OSC aiming to improve model prediction and interpretability were also presented [16, 17], sharing the same spirit in that X variation unrelated to the response is filtered out as a preprocessing step. Thus in the combined approach, first, OSC as a preprocessing step of PLS corrects X by removing systematic variation in the X space not correlated to Y. Second, we relate the orthogonalsignal corrected X to the PLS regression models, which we call OPLS models, and obtain OPLSinduced regression vectors. The regression vectors in OPLS models describe each variable’s contribution to the response Y while reflecting variables collectively on new directions. Lastly, by employing a permutation testing procedure, in which permutations of the input observations for a collection of randomly chosen features occurs, we test the significance of each coefficient in the regression vectors in a reasonably fast manner. The permutation test procedure yields empirical null distributions of weights for each individual variable’s effect on the response Y. The contribution of this paper is the integration of permutation test into feature selection of OPLS models combining the concepts of OSC and PLS. It then introduces the use and testing of OPLSR vectors for the purpose of feature selection, particularly in the domains of spectral data.
This paper is organized as follows. “Proposed method” section describes the combined method of orthogonal signal correction and PLS, manifesting the adopted permutation test procedure. It also includes a simulation study in which simulated data sets with a threelevel network structure are used and the performance of the proposed approach is demonstrated. The proposed approach is compared with two methods: one is the Lasso, penalized linear model, and the other is the false discovery rate (FDR) method which is based on the variables’ individual effects on the response Y and is widely used in spectral data analysis [18, 19]. “Experiments” section demonstrates the approach with a reallife NIR data set. Finally, “Conclusions” section concludes the paper.
Proposed method
We describe orthogonal signal correction and PLS to obtain OPLS regression vectors. Then we introduce a permutationbased test procedure to test the significance of each variable’s effect in OPLS regression vectors.
Orthogonal PLS regression vector
In the analysis of chemometric data using a multivariate regression model Y=XB+E, it is common that the first component accounting for the highest percentage of the input X variation constitutes only a low percentage of the response Y variation. The input X and response Y are assumed to be meancentered and properly normalized. In PCA, a principal component vector (or score vector) t is calculated by a linear combination of an input observation x_{j},j=1,…,n (row of X) and a loading vector p. Score vector t can be regarded as a transformed variable by the corresponding loading vector. Loading vector p is numerically found as an eigenvector of the sample covariance matrix X^{T}X/(n−1). PLS regression, however, focuses on the covariance structure between X and Y. In PLS regression, score vectors and loading vectors for X and Y are jointly and numerically obtained to maximize the covariance between X score vectors and Y score vectors [20].
The goal of orthogonal signal correction (OSC) is to remove one or more directions in X (n×p) that are unrelated, or more precisely orthogonal, to Y and to account for the largest variation in X as well. OSC often serves as a preprocessing step to improve the multivariate regression model [21]. Particularly, the PLS regression coefficients after OSC have stronger interpretability and clarity. In this paper, we chose direct orthogonal signal correction since it not only bears close relations with other OSC methods and works quite well with empirical data, but also uses only least squares methods without iterative computation [22]; this means it can be analytically connected with subsequent PLS. We summarize the basic steps as follows and link them to PLS to obtain orthogonalsignalcorrected PLS regression (OPLSR) vectors that will be used for feature selection.
The first step is to take the projection of Y onto X, \(\hat {Y} = \mathbf {P}_{X} Y\), where P_{X} represents a projection matrix to the column space of X, denoted by C(X). Furthermore, we write Y as follows:
where A_{X}=I−P_{X} represents a projection matrix to the space orthogonal to the column space of X. Accordingly, A_{X}Y is orthogonal to C(X): for v∈C(X),
For underdetermined systems, p>n, as is common in spectral data, \(\hat {Y}\) equals Y.
The second step is to decompose X into two orthogonal parts, one part that has the same range \(\hat {Y}\) and another that is orthogonal to it:
Thus, the space spanned by the columns of \(\mathbf {A}_{\hat {Y}}X\), which is essentially a subspace in X and orthogonal to \(\hat {Y}\), is a target of removal from X in the OSC procedure. The space \(\mathbf {A}_{\hat {Y}}X\) is also orthogonal to Y: for \(\mathbf {v} \in C(\mathbf {A}_{\hat {Y}}X)\),
since \(\hat {Y}\) is orthogonal to \(\mathbf {A}_{\hat {Y}}X\) by definition and v, also in C(X), is orthogonal to A_{X}Y by (1).
The third step is to find the first principal score vector t_{OSC} and the associated loading vector p_{OSC} from \(\mathbf {A}_{\hat {Y}}X\), so that \(\mathbf {t}_{OSC}\mathbf {p}^{\top }_{OSC}\) approximates to \(\mathbf {A}_{\hat {Y}}X\). Let X^{ORTH} be the approximation of \(\mathbf {A}_{\hat {Y}}X\) by \(\mathbf {t}_{OSC}\mathbf {p}^{\top }_{OSC}\): \(X^{ORTH}=\mathbf {t}_{OSC}\mathbf {p}^{\top }_{OSC}.\) The score vector t_{OSC}, a direction or an OSC component, accounts for the maximum variance of \(\mathbf {A}_{\hat {Y}}X\). Moreover, the score vector \(\mathbf {t}_{OSC} \in \mathbf {A}_{\hat {Y}}X\) is orthogonal to \(\hat {Y}\) by the definition of \(\mathbf {A}_{\hat {Y}}X\) and also orthogonal to Y by (2):
Being in C(X), the score vector t_{OSC} is expressed as a linear combination of the columns of X with a weight vector r_{OSC}:
where X^{†} is the MoorePenrose pseudoinverse of X. Overall, the orthogonalsignal corrected X preserving predictive components, denoted by X^{OSC}, writes:
Here, weight vector r_{OSC} consists of weights of individual variables contributions to the construction of a space orthogonal to Y, which will be used to relate the input matrix X to the OPLSR vectors in the last step. Practically, it is necessary to limit the number of OSC components. To avoid overfitting, which results in a loss of model generality, the number of OSC components should not be too high. Mostly, one or two OSC components are sufficient. The first OSC component often describes a baseline correction, and the second can serve as the correction of multiplicative effects [22]. In this study, we chose one OSC component in consideration of the intuitive interpretation and the sufficient amount of its variability in X in practice.
Now that X^{OSC} is found, we apply PLS to X^{OSC} and Y. The reason PLS is applied to X^{OSC} and not directly to X is that the covariance between X and Y equals that between X^{OSC} and Y:
because t_{OSC} is orthogonal to Y by (3). Thus, the PLS finds a few score vectors of X^{OSC} that maximize the covariance between X^{OSC} score vectors and Y score vectors. The scores of X^{OSC},t=[ t_{1} t_{2} ⋯ t_{A} ] (of size n×A) corresponding to A predictive components, are not only orthogonal, but also obtainable by computing its associated weight vectors r=[ r_{1} r_{2} ⋯ r_{A} ] from the singular value decomposition of the covariance matrix (X^{OSC})^{⊤}Y:
Accordingly, the corresponding loading of X^{OSC}, denoted by P, is given by P=(X^{OSC})^{⊤}t, meaning new weights of X^{OSC} are directly evaluated by t. Similarly, the loading of Y, denoted by Q, is computed by Q=Y^{⊤}t and represents new weights of Y that lead to maximization of the covariance matrix using the A predictive components commonly applied to X^{OSC} and Y [23].
Overall, X^{OSC} and Y are decomposed into the following:
where u is the score matrix of Y given by u=YQ; and E and F are residual matrices.
By regressing Y on the score t, which is a compact representation of X^{OSC} based on the A predictive components, from the regression model Y∼tβ, the leastsquares estimate of the coefficient β becomes \(\hat {\beta } = (\mathbf {t}^{\top }\mathbf {t})^{1}\mathbf {t}^{\top } Y = \mathbf {t}^{\top } Y\) due to the orthogonality of t. Then the prediction model \(\hat {Y}\) is
where \(\hat {\beta }_{OPLSa} = \mathbf {r} Q^{\top }\) is the regression vector using the orthogonal signal corrected X, X^{OSC}. Furthermore, the prediction model \(\hat {Y}\) can be rewritten as follows:
where \(\hat {\beta }_{OPLSb}=\left (I\mathbf {r}_{OSC} \mathbf {p}_{OSC}^{\top }\right) \hat {\beta }_{OPLSa}\) is the regression vector based on X. Both \(\hat {\beta }_{OPLSa}\) and \(\hat {\beta }_{OPLSb}\) consist of weights of individual variables i, i=1,⋯,p, contributing to the change of Y. In other words, a large absolute value of \(\hat {\beta }_{OPLSa,i}\) or \(\hat {\beta }_{OPLSb,i}\) indicates that the ith variable (x_{i}) contributes substantially to the increase or decrease of Y depending on the sign. We notice that the use of \(\hat {\beta }_{OPLSb,i}\) in contrast to \(\hat {\beta }_{OPLSa,i}\) brings comprehensive examination of individual variable effects because it includes the presence of all X directions, both orthogonal and predictive. Thus we propose \(\hat {\beta }_{OPLSb}\) as the final regression coefficient vector, regarding \(\hat {\beta }_{OPLSa}\) as a side source for comparison. This comparison will be demonstrated in a simulation study and a reallife data example that follow. We will obtain the distributions of \(\hat {\beta }_{OPLSa}\) and \(\hat {\beta }_{OPLSb}\) to systemically select variables that significantly contribute to the change of Y.
Permutation tests
Since the OPLS regression vectors, \(\hat {\beta }_{OPLSa}\) and \(\hat {\beta }_{OPLSb}\) are computed, the aim here is to test the significance of individual variables effects on the vectors. Let \(\hat {\beta }_{OPLS}\) denote the two regression vectors generally. Permutation tests, a computerbased resampling method for achieving accuracy measures of statistical estimates from an approximating distribution [24], are widely used in the computation of variable importance and confidence intervals in random forests [25, 26].
The advantage of permutation tests is that it is straightforward to simulate empirical null distributions of complex statistics such as percentiles, odds ratios, or correlation coefficients. The exact distributions of OPLSR vectors are too complex to obtain, and so are even approximations to the distributions. Thus the permutation test is employed to test the significance of individual variables in the OPLSR vector.
To propose a permutation test procedure to test the significance of the \(\hat {\beta }_{OPLS}\) coefficients, the basic procedures are presented as follows:

Step 1
Perform PLS with Y and X^{OSC}.

Step 2
Obtain the observed values of \(\hat {\beta }_{OPLS,i}\) for all i.

Step 3
For each variable i, repeat the following a large number of times (e.g., 999 times):

Step 3.1
Randomize the values within the ith column of X^{OSC}.

Step 3.2
Perform PLS with Y and the permuted X^{OSC}.

Step 3.3
Obtain the realized value of \(\hat {\beta }_{OPLS,i}\)

Step 3.1

Step 4
Estimate the pvalue p_{i} for the ith variable (e.g., (the number of the realized equal to or larger than the observed in terms of absolute values+1)/1000).

Step 5
Choose a significance level α_{pm} so that variables in which p_{i}<α_{pm} are selected.
For the randomization procedure at Step 3.1, one can choose several columns of X randomly, i.e., a portion of the total number of variables in one iteration to speed up the procedure since the total number of variables is usually quite large. We note that the proportion of 0.3 worked well in practice, producing the empirical null distributions of realized \(\hat {\beta }_{OPLS,i}\) that were highly close to normality. In this manner, the realized values of \(\hat {\beta }_{OPLS,i}\) can be regarded as emerging from a collection of the null hypotheses that the variables effects are insignificant. The motivation is that practically we can assume a substantial number of variables are negligible among the large number of variables, which is quite common in spectral data. The coefficients \(\hat {\beta }_{OPLS,i}\) for the variable i become zero in theory if the variable i does not contribute to the construction of response variables Y. Accordingly, the pvalue is computed according to the extreme ‘twotailed’ directions at Step 4. We also note that at Step 4, the observed value is included as one of the possible values of the randomization procedure. To control the increase of Type I error due to multiple testing, a correction can be made at Step 5. For instance, one can adjust the significance level α_{pm} to be α_{pm}/p using the Bonferroni procedure. In this study, we set α_{pm} to 0.05 due to its practicality. As a preprocessing step for the permutation test with significance level α_{f}, we filter out the variables of which the individual correlation measures with Y are weak. This step eliminates unnecessary noisy variables that can act as contributing variables from the whole procedure, so that only filteredin variables are considered. The level of α_{f} is considered as a tolerance level for the OPLSR method and usually set to 0.05. By changing the level of α_{f}, we can control the number of variables included in the permutation test.
Simulation and comparisons
To test the OPLS regressionbased feature selection method, we performed a simulation study. We first generated a data matrix X(40×1000) and a response matrix Y(40×1), comprising 40 samples and 1000 variables per sample. Each element of Y was a Bernoulli random variable with success probability 0.4, so 16 elements of Y were set to 1 on average while the remaining 24 elements were set to 0. To generate 1000 variables, we used a threelayer network structure as shown in Fig. 1a.
In the first layer, the first 30 variables were generated to have high separation in Y: for p=1,…,4,x_{p}∼U(0,1)+0.8−2Y and for p=5,…,9,x_{p}∼U(0,1)−1.2−2Y, where U(a,b) is a random variable from a uniform distribution of range a and b. This means the variables from x_{1} to x_{4} individually had a high positive correlation with Y (labeled 0), whereas x_{5} to x_{9} were highly correlated with Y (labeled 1). For instance, Fig. 1b shows the plots of realizations of x_{1} and x_{5} and the aforementioned pattern that individually separate Y. The first 30 variables in the first layer represent strong individual variables that can identify pathological conditions. For p=10,13,…,28,
where N(μ,Σ) represents a multivariate normal distribution with mean μ and covariance Σ. Figure 1c shows plots of realizations of three variables in the layer, generated as above, jointly discriminate Y. The variables x_{i},i=10,…,30, in the first layer represent strong groupwise variables that strongly discriminate pathological conditions.
The next 90 variables from x_{31} to x_{120} in the second layer and the next 270 variables from x_{121} to x_{390} in the third layer were generated so that those variables contribute to the overall response Y in a composite and aggregate manner. For instance, x_{1} in the first layer is clearly separable by a combination of x_{31},x_{32}, and x_{33} in the second layer. Specifically, the generation of three variables is based on the value x_{1} so that the sum of the three will be close to x_{1} as follows: we set \(x_{31} = \frac {u_{1}}{u_{1} + u_{2} +u_{3}}(x_{1} + \epsilon), x_{32} = \frac {u_{2}}{u_{1} + u_{2} +u_{3}}(x_{1} + \epsilon)\), and \(x_{33} = \frac {u_{3}}{u_{1} + u_{2} +u_{3}}(x_{1} + \epsilon)\) using independently generated u_{1},u_{2}, and u_{3} from U(0,1) and ε from N(0,x_{1}/10). We notice that x_{31}+x_{32}+x_{33}=x_{1}+ε. The remaining 610 variables from x_{391} to x_{1000}, comprising a noise layer, were randomly and independently generated from N(0,1) so as to have a weak correlation with Y. They represent the presence of inherent noise. The adopted threelayers structure is a simulated example of multiple layers of spectral collinearity, interaction, and regulation in complex biological systems. For instance, a biological system for nutritional metabolomics reflects such a layered structure with linked transports [27].
Using the simulated data, we tested the performance of the proposed methods in comparison with the false discovery rate (FDR) method to detect the known variables in the three layers. Focusing on the effects of individual variables and controlling familywise Type I error, FDR serves as a baseline method to compare the OPLSR method with. The two types of OPLS regression vector, \(\hat {\beta }_{OPLSa}\) and \(\hat {\beta }_{OPLSb}\), were considered. The number of predictive components for the OPLS regression model, A as in (6), varied from 1 to 3 so as to determine its effect upon the performance. The filtering level α_{f} for the OPLSR method and the qvalue for FDR varied from 0.01 to 0.05 and 0.10. We repeated this test 1000 times, and in each repetition for each method, the number of variables that were found within the three layers were counted. No variables among the random 610 variables were found for each of the two methods.
To illustrate the performance of the proposed method, Fig. 2 shows illustrations of the empirical null distribution of the realized \(\hat {\beta }_{OPLSb,i}\) and regression vector \(\hat {\beta }_{OPLSb}\). The empirical null distribution of \(\hat {\beta }_{OPLSb,i}\) in Fig. 2a, obtained by the permutation procedure, is closely normally distributed and provides a basis for testing the observed \(\hat {\beta }_{OPLSb,i}\). Figure 2b also shows the empirical null distribution of the realized \(\hat {\beta }_{OPLSb}\) for the variable with the largest confidence interval. This illustration, being the worst case, demonstrates that the distribution of the realized regression coefficients for individual variables can be sufficiently approximated to a normal distribution by the setting. The regression vector \(\hat {\beta }_{OPLSb}\) in Fig. 2c highlights the filteredin variables according to the α_{f}level filtering (green squares), confidence intervals of the permutation procedure (red arrows with dotted lines), and selected variables (black circles) along with the observed regression vector \(\hat {\beta }_{OPLSb,i}\) (blue dots). We notice that observed \(\hat {\beta }_{OPLSb,i}\) for i=1,⋯,390 are substantially larger than those for the rest, which indicates the regression vector reflects the network structure for the data set. The selected variables gather in the first layer as shown in Fig. 2d, which implies the testing procedure of the method suitably separate known important variables. Particularly, we notice that the variables (from x_{1} to x_{8}) are positioned accordingly with their contributions to the labels. For example, the realized value \(\hat {\beta }_{OPLSb,i}\) for x_{1} is negative, implying that the increase of x_{1} results in the increase of label 0 instead of label 1. This is in accordance with the behavior of x_{1} in Fig. 1b. In Fig. 2d we also observe that the filteredin variable 71 (x_{71}), for example, was not selected since the observed \(\hat {\beta }_{OPLSb,i}\) is within the confidence interval of the realized \(\hat {\beta }_{OPLSb,i}\).
Additionally, Table 1 shows the amounts of variation of X^{ORTH},X^{OSC}, and Y in the OPLSR method. The first OSC component and the first PLS component accounted for more than 96% of the total X variation and 99% of the total X^{OSC}, respectively. It is an empirical evidence that using the first OSC component and the first one or two PLS components is sufficient. The first PLS component accounted for more than 99% of the Y variation in the simulation study. In fact, we used the first OSC component when correcting X in (5). Table 2 shows the average numbers of selected variables for OPLSR and FDR according to significance level α. Since the variables in layer 1 are significant, the selected variables in layer 1 mean the recall rate. The OPLSR method is divided into \(\hat {\beta }_{OPLSa}\) (denoted by OPLSR _{a}) and \(\hat {\beta }_{OPLSb}\) (denoted by OPLSR _{b}). We chose significance level α as qvalue for FDR and α_{f} for OPLSR. We note that the levels for the methods are not strictly equivalent by themselves, yet we compare them in that they practically adjust the number of selected variables.
The results for this simulation study show that OPLSR _{a} consistently found more variables than FDR. The performance of OPLSR _{b}, working quite well, was comparable with that of FDR while surpassing FDR for big α.
Experiments
We applied the proposed method in the examination of a highresolution metabolomics data set. We show the use of the proposed method, and furthermore analysis will follow with the next dataset. The code and data are uploaded to the following GitHub url: https://github.com/leegs52/OPLSR. The metabolomics data set consists of 127 samples screened for bile acids by a BioQuant colormetric kit [28]: 64 samples have bile acids present and the others have bile acids absent. The metabolomic profiling was performed using liquid chromatography (LC) coupled to highresolution mass spectrometry by an Orbitrap FTQVelos mass spectrometer. Peak extraction and quantification of ion intensities were performed by an adaptive processing of liquid chromatography mass spectroscopy software package that produced 7068 m/z (mass divided by charge number of ions) values. We aimed to extract significant metabolites, for example the top 1%, that separate bile acid present and bile acid absent. Application of the OPLSR method to the data matrix X of size 127×7068 and response Y of size 127×1 yielded the empirical null distribution of OPLS regression vector \(\hat {\beta }_{OPLSb}\). Figure 3 shows twosided 95% confidence intervals (red rectangles) and significant m/z variables (black circles) using the permutation testing of OPLS regression. The small blue dots represent the OPLS regression coefficients. The proposed method was able to filter out significant variables in that the number of found significant variables (black circles) by the OPLS is 11 while the number of variables (green circles) by individual coefficient testing of regression analysis is 461. The number of significant variables by either Lasso (marked by yellow circles) or FDR (marked by red crosses) are 11, like the OPLS. For information, all three of them found x_{1267},x_{1360}, and x_{2271} the significant variables.
We also applied the proposed method to a nearinfrared (NIR) spectroscopic technique, which is a useful tool for chemical process analysis in research such as pharmaceutical, medical diagnostics, and agrochemical quality. Measured NIR spectroscopic spectra are influenced by external process variables such as temperature, pressure, and viscosity. The difficulty in keeping these variables unchanging and the necessity to change their value during the process (e.g., setting temperature and pressure in batch processes) make it necessary to assess the influence on the NIR spectra. Wülfert et al. took shortwave NIR spectra of ethanol, water, and 2propanol mixtures at different temperatures to assess the influence of those temperatureinduced spectra variations [29]. The proposed method was applied to the 22 spectra of the mixtures at each temperature of 30, 40, 50, 60, and 70 ^{∘}C (n=22×5=110). The used wavelengths are from 580 to 1091 nm, resulting 512 variables (p=512). The overall 110 spectra are shown in Fig. 4a.
The data set consists of the 110 spectra as X (110×512) and temperatures as Y (110×1). For the purpose of testing predictive power, the data set was randomly split into a training set with 70% of the whole for building a prediction model and a test set with the remaining 30% for estimating the predictive quality of that model. Using the selected variables for each method of OPLSR _{a}, OPLSR _{b}, FDR, and Lasso, we carried out regression analysis to predict Y of the test set and calculated the meansquared error (MSE) of prediction and the respective amount of Y variance being described by the model, Q^{2}, as follows:
This was repeated 3,000 times. We also measured the precision, denoted by P, and the number of selected variables, denoted by N, as additional information on the performance of each method. The precision P is the fraction of correctly selected variables compared to all selected variables. For the NIR spectra of temperatures, Fig. 4a shows that the variation between them is highly significant around 980 nm with a peak maximum [30]. By considering each wavelength within the 9301060 nm as the groundtruth features, we verified the selected variables.
The illustrations of the proposed method for the NIR spectra are shown in Fig. 4b and c, which depict the regression vector \(\hat {\beta }_{OPLSb}\) and the empirical null distribution of realized \(\hat {\beta }_{OPLSb,i}\), respectively. The wavelength ranges of the selected variables were from 960 to 970 nm (x_{381} to x_{391}), which corresponds to the free OH 2nd overtone, and from 1049 to 1060 nm (x_{470} to x_{481}), which belongs to the hydrogenbonded OH groups [31]. This finding is consistent with the chemical observation that temperature effects are related to the absorbance of molecule overtones and the cluster size of hydrogenbonded molecules [29]. The empirical null distribution in Fig. 4c, following closely normality, supply the proposed method with the basis of the employed permutation test.
Table 3 shows the comparison results in terms of evaluation metrics and the number of selected variables. We observed that the OPLSR method reliably finds significant variables among the spectra most of the iterations since its precision outperformed FDR and Lasso. It is not surprising because the proposed method examines the effects of variables in a collective manner by considering the covariance structure and the effects of both orthogonal and predictive components at the same time. The regression performance for the variables selected by Lasso achieved relatively high predictive performance. Though Lasso was unsatisfactory in the precision. For all criteria of the precision, MSE, and Q^{2}, OPLSR _{b} outperformed FDR consistently. The approach OPLSR _{a} also outperformed FDR consistently. The method OPLSR _{b} outperformed OPLSR _{a} for α_{f}=0.01,0.05 in terms of the precision, MSE, and Q^{2}, while there was little difference between the two for α_{f}=0.10.
Additionally, in order to demonstrate the robustness of the proposed method, we conducted downsampling and then performed another experiment under the same conditions of Table 3. The downsampled data set was randomly drawn from NIR spectra data (n=110), and it contains 80 samples, i.e., about 70 percent of total. For the downsampled data, Table 4 presents the performance comparison for the four methods, demonstrating the degree of robustness.
Conclusions
We presented a feature selection method based on orthogonalsignal corrected PLS regression vectors to identify significant variables associated with the response characteristics. The proposed OPLSR method integrates PLS with orthogonal signal correction and permutation tests. To remove unnecessary variation in the input variables and improve interpretability of PLS regression vectors, orthogonal signal correction was applied first. The orthogonalsignal corrected PLS procedure reflects the variable interrelationships under complex systems, which are easily represented in a network structure. The two types of regression vectors from the model, carrying information on variables contribution to response characteristics, were derived and investigated in both a simulation study and a reallife spectra study in contrast to FDR and Lasso. To select the significant variables from the regression vectors, we applied a permutation test that generates empirical null distributions of variable effects on the response characteristics. The adopted permutation test was provided with the filtering rate, a predefined tolerance level for the whole selection procedure for eliminating unnecessary noisy variables, and was implemented efficiently by taking advantage of a collection of insignificant variables. Through simulations, we observed that the proposed method well captured the predefined network structures and successfully found the known variables. We demonstrated this method with reallife metabolomics and NIR spectra data, the finding variables that achieve a good level of predictive power and accurately relate to the chemical observations. For future research, we hope to investigate the effect of imbalance classes in the feature selection.
Availability of data and materials
The implemented MATLAB package is available from the following url address: https://github.com/leegs52/OPLSR
References
Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing. 2018; 300:70–9.
Zhu Z, Ong YS. Memetic algorithms for feature selection on microarray data. In: International Symposium on Neural Networks. Berlin Heidelberg: SpringerVerlag: 2007. p. 1327–35.
Alsberg BK, Woodward AM, Winson MK, Rowland JJ, Kell DB. Variable selection in wavelet regression models. Anal Chim Acta. 1998; 368(12):29–44.
Joe Qin S. Statistical process monitoring: basics and beyond. J Chemometr. 2003; 17(89):480–502.
Holmes E, Antti H. Chemometric contributions to the evolution of metabonomics: mathematical solutions to characterising and interpreting complex biological NMR spectra. Analyst. 2002; 127(12):1549–57.
Lindon JC, Holmes E, Nicholson JK. Pattern recognition methods and applications in biomedical magnetic resonance. Prog Nucl Magn Reson Spectrosc. 2001; 39(1):1–40.
Park YH, Kong T, Roede JR, Jones DP, Lee K. A biplot correlation range for groupwise metabolite selection in mass spectrometry. BioData Min. 2019; 12(1):4.
Gerlach RW, Kowalski BR, Wold HOA. Partial leastsquares path modelling with latent variables. Anal Chim Acta. 1979; 112(4):417–21.
Brás LP, Lopes M, Ferreira AP, Menezes JC. A bootstrapbased strategy for spectral interval selection in PLS regression. J Chemometr. 2008; 22(1112):695–700.
Jiang JH, Berry RJ, Siesler HW, Ozaki Y. Wavelength interval selection in multicomponent spectral analysis by moving window partial leastsquares regression with applications to midinfrared and nearinfrared spectroscopic data. Anal Chem. 2002; 74(14):3555–65.
Heise HM, Bittner A. Rapid and reliable spectral variable selection for statistical calibrations based on PLSregression vector choices. Fresenius J Anal Chem. 1997; 359(1):93–9.
Høskuldsson A. Variable and subset selection in PLS regression. Chemometr Intell Lab Syst. 2001; 55(12):23–38.
Faber NKM. Uncertainty estimation for multivariate regression coefficients. Chemometr Intell Lab Syst. 2002; 64(2):169–79.
Wehrens R, Van der Linden WE. Bootstrapping principal component regression models. J Chemometr Soc. 1997; 11(2):157–71.
Wold S, Antti H, Lindgren F, Öhman J. Orthogonal signal correction of nearinfrared spectra. Chemometr Intell Lab Syst. 1998; 44(12):175–85.
Fearn T. On orthogonal signal correction. Chemometr Intell Lab Syst. 2000; 50(1):47–52.
Svensson O, Kourti T, MacGregor JF. An investigation of orthogonal signal correction algorithms and their characteristics. J Chemometr. 2002; 16(4):176–88.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.
Kim SB, Chen VCP, Park Y, Ziegler TR, Jones DP. Controlling the false discovery rate for feature selection in highresolution NMR spectra. Stat Anal Data Min. 2008; 1(2):57–66.
Vinzi VE, Chin WW, Henseler J, Wang H, et al. Handbook of partial least squares, vol 201. Berlin Heidelberg: SpringerVerlag; 2010.
Trygg J, Wold S. Orthogonal projections to latent structures (OPLS). J Chemometr Soc. 2002; 16(3):119–28.
Westerhuis JA, de Jong S, Smilde AK. Direct orthogonal signal correction. Chemometr Intell Lab Syst. 2001; 56(1):13–25.
de Jong S. SIMPLS: an alternative approach to partial least squares regression. Chemometr Intell Lab Syst. 1993; 18(3):251–63.
Efron B, Tibshirani R. An introduction to the bootstrap, vol 57. Boca Raton: CHAPMAN & HALL/CRC CRC Press LLC; 1993.
Ishwaran H, Lu M. Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 2019; 38(4):558–82.
Tsuzuki S, Fujitsuka N, Horiuchi K, Ijichi S, Gu Y, Fujitomo Y, Takahashi R, Ohmagari N. Factors associated with sufficient knowledge of antibiotics and antimicrobial resistance in the Japanese general population. Sci Rep. 2020; 10(1):1–9.
Jones DP, Park Y, Ziegler TR. Nutritional metabolomics: progress in addressing complexity in diet and health. Annu Rev Nutr. 2012; 32:183–202.
Neujahr DC, Uppal K, Force SD, Fernandez F, Lawrence C, Pickens A, Bag R, Lockard C, Kirk AD, Tran V, et al. Bile acid aspiration associated with lung chemical profile linked to other biomarkers of injury after lung transplantation. Am J Transplant. 2014; 14(4):841–8.
Wülfert F, Kok WT, Smilde AK. Influence of temperature on vibrational spectra and consequences for the predictive ability of multivariate models. Anal Chem. 1998; 70(9):1761–7.
Zhang G, Cui Q, Liu G. Efficient nearinfrared quantum cutting and downshift in Ce3+Pr3+ codoped SrLaGa3S6O suitable for solar spectral converter. Opt Mater. 2016; 53:214–7.
Bonanno AS, Olinger JM, Griffiths PR. The origin of band positions and widths in near infrared spectra. Near Infrared Spectroscopy: bridging the gap between data analysis and NIR applications. London: Ellis Horwood. 1992;:19–28.
Acknowledgments
Not applicable.
Funding
None to declare
Author information
Authors and Affiliations
Contributions
GL performed analyses on the simulated data and participated in the writing of the manuscript. KL implemented the algorithms, motivated the research problem, and designed the study. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lee, G., Lee, K. Feature selection using distributions of orthogonal PLS regression vectors in spectral data. BioData Mining 14, 7 (2021). https://doi.org/10.1186/s13040021002403
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040021002403