Sparse generalized linear model with L0 approximation for feature selection and prediction with big omics data

Background Feature selection and prediction are the most important tasks for big data mining. The common strategies for feature selection in big data mining are L 1, SCAD and MC+. However, none of the existing algorithms optimizes L 0, which penalizes the number of nonzero features directly. Results In this paper, we develop a novel sparse generalized linear model (GLM) with L 0 approximation for feature selection and prediction with big omics data. The proposed approach approximate the L 0 optimization directly. Even though the original L 0 problem is non-convex, the problem is approximated by sequential convex optimizations with the proposed algorithm. The proposed method is easy to implement with only several lines of code. Novel adaptive ridge algorithms (L 0ADRIDGE) for L 0 penalized GLM with ultra high dimensional big data are developed. The proposed approach outperforms the other cutting edge regularization methods including SCAD and MC+ in simulations. When it is applied to integrated analysis of mRNA, microRNA, and methylation data from TCGA ovarian cancer, multilevel gene signatures associated with suboptimal debulking are identified simultaneously. The biological significance and potential clinical importance of those genes are further explored. Conclusions The developed Software L 0ADRIDGE in MATLAB is available at https://github.com/liuzqx/L0adridge. Electronic supplementary material The online version of this article (doi:10.1186/s13040-017-0159-z) contains supplementary material, which is available to authorized users.


Background
Integrating multilevel molecular and clinical data to design preventive, diagnostic, and therapeutic solutions that are individually tailored to each patient's requirements is the ultimate goal of precision medicine. However, the huge number of features makes it neither practical nor feasible to predict clinical outcomes with all omics features directly. Thus, selecting a small subset of informative features (biomarkers) to conduct association studies and clinical predictions has become an important step toward effective big data mining. Statistical tests or univariate correlation analysis for feature selection ignore the interacting relationship among genes. To evaluate the predictive power of the features, one appealing approach for feature selection is L 0 regularized sparse modeling, which penalizes the number of nonzero features directly. L 0 is known as the most essential sparsity measure and has nice theoretical properties. However, it is computational impossible to perform an exhaustive search when analyzing omics data sets with millions of features. L 0 penalized optimization is known to be NP-hard in general (Lin et al. 2010).
One common strategy for feature selection is to replace the non-convex L 0 with the L 1 norm. L 1 is a convex relaxation and loose approximation of L 0 . Although L 1 penalized sparse models [1] can be solved efficiently, the estimators with L 1 are penalized too much and asymptotically biased. In addition, L 1 inclines to select more spurious features than necessary, and may not always choose the true model consistently [2]. Theoretically, L 1 never outperforms L 0 by a constant [3]. Depending on the location of true optimum, L 1 may perform much worse than L 0 [4,5]. As a result, the convex relaxation techniques have been shown to be suboptimal in many cases [6]. More recent approaches aimed to reduce bias and overcome discontinuity include the non-convex SCAD [7] and MC+ [8]. However, none of the existing algorithms directly approximate the L 0 optimization problem. Either SCAD or MC+ has been rarely used for feature selection in big data analytics because of their computational intensity with multiple tuning parameters. On the other hand, recent research works including ours show that sparse regression models with L 0 penalty (local solution) outperforms L 1 (global solution) by a substantial margin [5,[9][10][11].
Debulking cytoreductive surgery is a standard treatment for ovarian cancer. The goal of debulking is to remove as much visible cancer as possible. However, if tumor nodules have invaded vital organs, surgeons may not be able to remove them without compromising the patient's life. Leaving tumor nodules larger than 1 cm is defined as suboptimal debulking (cytoreduction). It has been shown that suboptimal debulking is associated with reduced chemosensitivity and poor survival in ovarian cancer. Biomarkers derived from multi-omics data may help physicians decide which patients should undergo surgery and which should be treated with chemotherapy first [12][13][14]. Identifying biomarkers from multi-omics data has been an exciting but challenging task. Sparse modeling is one of the important approaches for simultaneous phenotype prediction and biomarker identification. In this paper, we propose a L 0 penalized generalized linear regression (GLM) for feature selection and prediction. Adaptive ridge algorithm (L 0 ADRIDGE) is developed to approximate L 0 penalized GLM with sequential convex optimization and is efficient in handling ultra high-dimensional omics data. The proposed method outperforms other cutting-edge convex and non-convex penalties including L 1 , SCAD and MC+ with simulations. When applied to the important suboptimal debulking prediction problem in ovarian cancer, the proposed approach identifies multilevel molecular signatures through mining methylation, microRNA and mRNA expression data jointly from TCGA. The identified molecular signatures are further evaluated using public databases.

Materials and methods
Given an input X N×P , where N P, and output Y, we have a generalized linear model with canonical link in the following form: where G is a canonical link function. Different link functions lead to different models. For instance, a logit link function leads to logistic regression, while an exponential link function leads to Poisson regression.

L 0 penalized GLM
The distribution of Y in GLM is assumed to be from the exponential families with the following probability (density) function: where φ is a dispersion parameter, and different functions A( * ), B( * ) and C( * ) are for different distributions Y [15]. The corresponding mean and variance are: Dropping the constants A(φ), and C(Y i , φ), we have the simplified log likelihood as follows: Hence, L 0 penalized error function to minimize is where |β| 0 = P j=1 I(β j = 0) is the number of nonzero elements in β, μ i = G(θ i ) and which is equivalent to the following system: Given η and θ i = x t i β, the derivative of E w.r.t. β is where indicates element-wise division. The Hessian matrix is The Newton-Raphson iteration for β is Different link functions will lead to different regression models as shown as in Table 1.
Other GLMs such as negative binomial, gamma, and inverse Gaussian can be implemented accordingly with a different V (μ). When dealing with big data problems with N P,where N is the number of samples and P is the number of parameters, the inverse of a P × P matrix is time-consuming and computational challenging. We proposed an Table 1 Link functions for linear, logistic and Poisson regression models in GLM, where different models have different A( * ), B( * ), and C( * ) Poisson regression exp(θ) exp(θ) log μ efficient algorithm to calculate the inverse of a much smaller N × N matrix as follows (Liu et al. 2015): So that when N P, we have a much efficient estimation: The adaptive ridge algorithm (L 0 ADRIDGEA) is implemented in MATLAB are as follows: The L 0 ADRIDGE Algorithm: Given a λ > 0, ε = 1e − 6, and training data {X, y}, The algorithm is easy to implement and very efficient for either small sample size and large dimension or large sample size and small dimension big data problem. The regularized parameter λ can be determined either by cross-validation or by AIC and BIC with λ = 2 and λ = log(N), respectively. We further discuss that the proposed method is a L 0 approximation and converges to L 0 when the number of iterations m → ∞.
Algorithm justification: Given a high-dimensional big feature matrix X N×P (N P) and a threshold γ for the coefficient estimates, L 0 rejects all the coefficient estimates below γ to 0 and keeps the large coefficients unchanged. This is the same as defining a binary vector s =[ . . . , 1, 0, . . . , 1] t , with the value of 0 or 1 for each feature, where s j = 1 if the coefficient estimate for that feature is above the threshold γ , and 0 otherwise. Let S = diag(s) be a matrix with s on its diagonal, we have the selected feature matrix X S = XS. We can build the standard models with the matrix X S , if we know s in advance. For instance, we can estimate the coefficients of a GLM with L 2 regulation given X S and Y with where and X t S VX S = SX t VXS = SX t VX because of the special structure of matrix S. It is guaranteed that the estimate is 0 for feature j with s j = 0. However, in reality we do not know s. Estimating both s and θ is an NP-hard problem, since we need to solve a mixed-integer optimization problem. Comparing Eq. (7) with Eq. (5), β new = (DX t VX + λI) −1 DX t Z, it is clear that S is replaced by D and a binary s j is approximated by a continuous η 2 j in proposed algorithm. Therefore, the proposed method is a L 0 approximation. Recall the iterative system in Eq. (3), note that each feature is penalized by a different penalty, which is inversely proportional to the squared magnitude of that parameter estimator η j . i.e., , and η j = β j .
Smaller β j will lead to larger λ j . A tiny β j , will become smaller and λ j will be getting larger in each iteration of L 0 ADRIDGE algorithm. β j → 0, and λ j → ∞. On the other hand, a larger β j will lead a finite λ j , and nonzero β j , when the number of iteration goes to ∞. The solution of L 0 ADRIDGE will converge to that of Eq. (7), because the effect of nonzero η j will be canceled out in Eq. (5). Note that our proposed methods will find a sparse solution with a large number of iterations and small ε, even though the solution of L 2 regularized modeling is not sparse. Small parameters (β j s) become smaller at each iteration and will eventually go to zero (below the machine ). We can also set a parameter to 0 if it is below predefined ε = 1e − 6 to speed up the convergence of the algorithm.

Simulations
Poisson Regression: Our first simulation was used to evaluate the performance of our method for high dimensional Poisson regression. The data was generated from Poisson distribution with different sample sizes (N) and dimensions (P). However, only features 1, 5, 10 and the constant term are used to generate the Poisson counts with [ β 0 , β 1 , β 5 , β 10 ] =[ 1, 0.5, 0.5, 0.4]. The count Y is generated with Y = Poisson(μ), where mean μ = exp(βX). The proposed method is compared with the glmnet ( [16] and SparseReg package [17,18]. glmnet and SparseReg implemented the elastic net, SCAD, and MC+ penalties with an efficient path algorithm. We compare the performance of our approach with L 1 (glmnet), SCAD and MC+ using the popular BIC (λ = log(N)) criteria. Our L 0 ADRIDGE is compared to the glmnet for L 1 and SparseReg for both SCAD and MC+. The results of different methods are presented in Table 2. Table 2 shows that our L 0 ADRIDGE consistently achieved the best performance with BIC and different sample sizes and dimensions. With BIC, although MC+ has the lowest square root of mean squared error (rMSE), and fits the data better, L 0 ADRIDGE achieves the least absolute bias |β − β|, highest percentage of identified true model (PTM), and lowest false discovery rate (FDR) under different simulation settings. The average number of selected features (ANSF) with L 0 ADRIDGE is also closest to the true number 4. Particularly, L 0 ADRIDGE found 100% true model with the lowest average absolute bias (0.086) under the dimension of P = 10, 000 and sample size of N = 500, indicating that the proposed approach is efficient under extra-high dimensional setting. Another interesting finding is that the square root of mean squared errors and absolute biases with L 0 ADRIDGE did not vary much across different simulation setting, indicating the robustness of the proposed approach. Moreover, L 0 ADRIDGE with BIC is slightly faster than different routines implemented in glmnet and SparseReg in computational time. Finally, BIC apparently is not a good model selection criteria for L 1 , SCAD and MC+. More features are selected than necessary. A larger λ is needed for selecting the correct model. We reported the results with a larger λ on Additional file 1: Table S1, and demonstrated that both SCAD and MC+ can achieve a much smaller FDR, but a larger absolute bias and rMSE.

Logistic regression:
The logistic regression data was generated with the coefficients of [ β 1 , β 5 , β 10 ] =[ 0.5, 0.5, −0.4], respectively, and the remaining coefficients were set to zero. The score z = Xβ + ε, where ε is the random noise with the signal to noise ratio of 4. Then, the probability y is generated from the logistic function y = 1/(1 + e −z ). Note that y is the true probability instead of binary (1/0) in this simulation. Unlike the previous example, the optimal values of λ in this simulation were selected with the standard 5-fold cross-validation. We divided the λ from λ min = 1e − 4, to λ max into 100 equal intervals in log-scale, then chose the optimal λ with the smallest test error. The simulation was also repeated 100 times. The computational results were reported in Table 3. The values in the parenthesis are the positive/negative standard deviation. Table 3 shows that L 0 ADRIDGE outperforms L 1 , SCAD and MC+ with a substantial margin under the 5-fold cross-validation. Cross-validation is a standard tool for parameter selection in machine learning. L 0 ADRIDGE achieved the smallest test square root of mean squared error, least absolute biases, the lowest FDR, and highest percentages of identified true models The average number of selected features are 3.33 and 3.41 for the dimensions of 100 and 1000, respectively, which are the closest to the true number of features 3. In contrary, L 1 , SCAD and MC+ selected unnecessary features.  once, and MC+ only identified the true model 2 times for the dimension of 100, indicating the super performance of L 0 ADRIDGE under cross-validation. Finally, L 0 ADRIDGE is robust. The test square root of mean squared error and other performance measures did not vary much when the dimension increased from 100 to 1000. It is worth noting that our proposed method performs well with the popular statistical model selection criteria such as BIC and cross-validation. Other popular methods such as L 1 , SCAD, and MC+ select more features than necessary with such criteria. Therefore, many popular packages including the commercial MATLAB usually choose a larger λ one standard deviation above the minimum test error with cross-validation, which is arbitrary and leads to larger bias. To overcome such bias in parameter estimation, some packages re-estimate the parameters with the selected features and standard GLM model. Unlike these methods, our proposed method performed much better without any postprocessing. Finally, the algorithm is very robust with different initialization. With N = 100, P = 1000 and 100 times of different randomized initialization, we achieved the trMSE of 0.437(±.003), average absolute bias of 0.0763(±.07), ANSF of 3.39(±1.154), PTM of 85% and FDR of 6.9%, which is quite similar to the results with a fixed initialization.

TCGA ovarian cancer data
The Cancer Genomic Atlas (TCCA) has generated a large amount of next generation sequencing and other omics data for ovarian adenocarcinoma (OC). In this study, we conducted integrated analysis of RNA-seq, miRNA expression, promoter methylation, and debulking status data from 367 OC patients. There are 342 microRNAs, 13,911 mRNA expression (in FPKM), and 21,985 promoter methylation values available. We first normalized different omics data and screened the debulking associated microRNA, mRNAs, and methylation promoters with the P-values of less than 0.01 with the training data only. Based on the central dogma of biology, suboptimal debulking is associated with microRNA expression, gene expression, and DNA methylation; gene expression is a function of microRNA expression and DNA methylation; and microRNA expression is regulated by DNA methylation. L 0 Logistic regression was used for suboptimal debulking prediction, while L 0 penalized Poisson regression was used for gene expression and microRNA expression prediction with FPKM. FPKM, representing fragments per kilobase of exon per million fragments mapped, measures the normalized read counts for RNA-seq. Three-fold cross validation was used for gene selection and validation. We reported the gene signatures with the best predicted area under the ROC curves (AUCs). Molecular signatures that are directly or indirectly associated with suboptimal debulking are shown in Fig. 1. Figure 1 indicates that there are 16 gene signatures including 7 mRNAs and 9 epigenetic markers directly associated with debulking status. Even though there is no microRNA directly associated with debulking, eight microRNA signatures are indirectly associated with debulking through their association with mRNA signatures. Moreover, there are additional 18 epigenetic markers indirectly associated with debulking. The 7 mRNAs directly associated with debulking are EIF3D, PPP1R7, ADA, HSD17B1, SRBD1, ZNF621, and BARX1, where EIF3D, PPP1R7, BARX1 and ZNF621 have positive correlations and the other 3 genes have negative correlations with suboptimal debulking. Among the 7 mRNAs, ADA (Adenosine Deaminase) is a well-studied gene in ovarian neoplasms. ADA levels were found to be significantly higher in patients with ovarian cancers as compared with benign ovarian tumors [19]. ADA has been regarded as a potential biomarker for diagnosis and an agent for the treatment of ovarian cancer [20]. Other mRNAs such as BARX1, EIF3D, PPP1R7, and HSD17B1 are also known to be associated with different cancers or other diseases. At the microRNA level, there are 8 microRNAs indirectly associated with debulking including mir-183, let-7b, mir-9-1, mir-377, mir-202, mir-758, Fig. 1 Gene signatures associated with suboptimal debulking, where nodes in red: mRNA signatures; nodes in green: microRNA signatures; nodes in pink: methylation signatures, and edges in red: positive partial correlation; edges in blue: negative partial correlation mir-375, and mir-30c-2. While let-7b, mir-30c-2, and mir-377 are positively correlated with suboptimal debulking through mRNAs ADA and BARX1 indirectly, the other 5 microRNAs have indirectly negative correlations with suboptimal debulking. Seven of eight microRNAs except for mir-758 are known to be associated with ovarian cancer. Particularly, let-7b is known to be an unfavorable prognostic biomarker and predict of molecular and clinical subclasses in high-grade serous ovarian carcinoma, and it may also be useful for discriminating between controls and patients with serous ovarian cancer [21,22]. Mir-183 is known to be associated with multiple cancers. It regulates target oncogene (Tiam1), and reduce the migration, invasion and viability of ovarian cancer cells [23]. Finally, at the DNA level, nine epigenetically modified genes directly associated with debulking are SSX1, TBR1, ZNF621, ORC3L, COL22A1, SPEF2, SSU72, EEF1D, and ZNF621, where EEF1D, SSU72, and ORC3L are positively associated with suboptimal debulking, while 6 other epigenetic genes are negatively correlated with suboptimal debulking. In addition, 18 other epigenetic genes indirectly associated with debulking may also have biological implications. Finally, integration of multi-omic data increases the prediction power substantially. Besides analyzing three types of omics data together, we performed the same three-fold cross validation for gene expression, methylation, and microRNA expression separately. The AUC curves are in Fig. 2. Figure 2 shows that the best predicted AUC over 100 simulations for integrated data is 0.88, while the best predictive AUCs for gene expression, methylation, and microRNA over 100 simulations are 0.81, 0.84, and 0.76, respectively. The AUC with integrated data achieved the highest AUC, indicating the importance of multi-omics data mining. Genes selected with mRNA, microRNA, and methylations separately are reported in the supplementary document. In addition, we also compare the selected features and the same number of top genes identified with statistical test. The results are reported on Additional file 1: Table S2, and demonstrate that although individual genes are more statistically significant, combination of a panel of genes with standard logistic regression has less predictive power and test AUC (0.79).