A feature selection method based on multiple kernel learning with expression profiles of different types

Background With the development of high-throughput technology, the researchers can acquire large number of expression data with different types from several public databases. Because most of these data have small number of samples and hundreds or thousands features, how to extract informative features from expression data effectively and robustly using feature selection technique is challenging and crucial. So far, a mass of many feature selection approaches have been proposed and applied to analyse expression data of different types. However, most of these methods only are limited to measure the performances on one single type of expression data by accuracy or error rate of classification. Results In this article, we propose a hybrid feature selection method based on Multiple Kernel Learning (MKL) and evaluate the performance on expression datasets of different types. Firstly, the relevance between features and classifying samples is measured by using the optimizing function of MKL. In this step, an iterative gradient descent process is used to perform the optimization both on the parameters of Support Vector Machine (SVM) and kernel confidence. Then, a set of relevant features is selected by sorting the optimizing function of each feature. Furthermore, we apply an embedded scheme of forward selection to detect the compact feature subsets from the relevant feature set. Conclusions We not only compare the classification accuracy with other methods, but also compare the stability, similarity and consistency of different algorithms. The proposed method has a satisfactory capability of feature selection for analysing expression datasets of different types using different performance measurements. Electronic supplementary material The online version of this article (doi:10.1186/s13040-017-0124-x) contains supplementary material, which is available to authorized users.

informative features from expression data remains a challenge and crucial problem. Feature selection technology has been studied and applied proverbially in pattern recognition, statistics analysis, data mining and machine learning [6]. In the last decade, feature selection technology has become an important tool for expression data analysis in the field of bioinformatics, such as cancer classification, biological network inference, expression correlation analysis and disease biomarker identification [7]. The features (mRNAs or miRNAs) of given expression data can be broadly categorized into three major types: relevant features, redundant features and irrelevant features [8].
In general, most feature selection methods can be divided into three categories: filter methods, wrapper methods, and embedded methods [7]. These categories depend on the combination modality of feature selection search and the construction of the classification model. Filtering methods, which are independent of the classifier, select relevant features only dependent the intrinsic properties of expression data. Glaab et al. applied an ensemble filter method which combines several selection schemes to an ensemble feature ranking [9]. Cai et al. proposed a feature weighting algorithm to estimate the feature weights through local approximation rather than global measurement. Experimental results on both synthetic and real microarray datasets validated that the algorithm was effective, when combining the proposed method with classic classifiers [10]. Cao et al. proposed a filtering feature selection method for paired microarray expression data analysis [11].
In wrapper approaches, the classification scores for features by a classifier are measured in the selection process and the step of feature selection depends on the classifier. So far, many wrapper feature selection methods have been proposed and used for expression data analysis. Mukhopadhyay et al. combined a multi-objective genetic algorithm and SVM classifier as a wrapper for evaluating the chromosomes that encode miRNA feature subsets [12]. Maulik et al. presented a fuzzy preference based rough set method for feature selection from gene expression data of microarray. Compared with signal-to-noise ratio and consistency based Feature Selection methods, experimental results showed that the method was effective in extracting gene markers [13].
In embedded approaches, the step of selecting an optimal feature subset is built into the classifier construction and the selecting can be seen the process combined space of feature subsets and hypotheses. With the increase of available expression data sources, several embedded feature selection methods have been presented to analyze expression data. Chen et al. proposed a feature selection approach using the information provided by the separating hyperplane and support vectors [14]. Mao et al. proposed a unified feature selection framework based on a generalized sparse regularizer for measuring the performance of multivariate [15]. Li et al. proposed a new feature selection algorithm called feature weighting as regularized energy-based learning. The experiments using microarray data demonstrated that the ensemble method, when using the L2 regularizer outperforms other algorithms in stability while providing comparable classification accuracy [16]. Kursa compared four state-of-the-art Random Forest-based feature selection methods in the gene selection context on microarray datasets, and found when the number of consistently selected genes was considered, the Boruta algorithm was the best one [17]. Yousef et al. developed a method for selecting significant genes, which uses K-means to identify correlated gene clusters and applies the scores of those gene clusters for the purpose of classification [18]. Tang et al. presented a two-stage Recursive Feature Extraction (RFE) algorithm, which can effectively eliminate most of the irrelevant, redundant and noisy genes, and select informative genes in different stages [8]. Niijima et al. suggested a recursive feature elimination model based on Laplacian linear discriminant analysis for feature selection [19]. However, these methods based on RFE may obtain satisfactory performance on hundreds of features. Such a large number of features (mRNAs or miRNAs) are difficult to apply to several fields, such as clinical diagnosis of cancer or experiments of identifying cancer biomarkers.
In recent years, several hybrid feature selection approaches have been also proposed for expression data analysis. Chuang et al. proposed a feature selection method, which combines an improved particle swarm optimization with the K-nearest neighbor method and support vector machine classifiers [20]. Mundra et al. developed a hybrid feature selection method by combining the filter method of minimum-redundancy maximum-relevancy (MRMR) and the wrapper method of support vector machine recursive feature elimination (SVM-RFE) [21]. Du et al. proposed a multi-stage feature selection method for microarray expression data analysis [22].
Though most of above methods can eliminate the irrelevant genes and rank informative genes effectively, they are only suitable for expression data from one type of expression profile. Most of the above methods construct the feature selection model based on one type of expression data directly, but they rarely consider the effectiveness and stability on expression data from different types of transcriptome. In this paper, we propose a novel two-stage feature selection method which uses multiple kernel learning (MKL) [23,24] combines a forward feature selection procedure to select the relevant feature subset, eliminate redundant features and select compact feature subsets. We simplify our proposed method as Simple MKL-Feature Selection (SMKL-FS), which eliminates irrelevant features and selects relevant features by the score of individual feature, and eliminates redundant features by the forward selection procedure in two stages.
One objective of feature selection is to avoid overfitting and improve the performance of classifier [7]. Overfitting is one of challenging problems on gene expression data which have characteristic of high dimensional and small sample. So, we used following processing to decrease the influence of overfitting on small samples. Firstly, we use the SimpleMKL method, which solves the MKL problem through a primal formulation involving a weighted l2-norm regularization. The regularization part adds a cost term for bringing in more features with the objective function. Hence, regularization can shrink the coefficients of many variables to zero and decrease the overfitting. Secondly, we used a sequential forward selection (SFS) method which belonged to deterministic methods and have lower overfitting risk than randomized methods [7]. In addition, we used cross validation in performance measurement part to identify these methods, which may have poor performance caused by overfitting training on several datasets.
In the following part, we outline the main steps of SMKL-FS. Firstly, we measure the relevance between features and classify samples by using the optimizing function of MKL. More specifically, we use an iterative gradient descent process to perform the optimization both on the parameters of SVM and kernel confidence, and obtain the optimizing function of each feature. Then, we select the relevant features set by sorting the optimizing function of each feature. Furthermore, we apply an embedded scheme of forward selection to detect the compact feature subsets from the relevant features set. Different from wrapper approaches, which convolve with a classifier and minimize the classification error of the dependent classifiers, we use optimizing function of MKL instead of classification error to carry out the embedded process. The idea of this process is similar as the minimum-redundancy process in mRMR [25]. Except for evaluating the classification accuracy of the method, we measure the performances of different feature selection algorithms through measuring the stability of feature space on different samples in the same type of data, the similarity with other methods and consistency between expression data of miRNA and mRNA.
The main characteristics of our proposed algorithm include: (i) a novel feature selection method for identifying gene signatures based on multiple kernel learning focusing on multiple types of expression data, such as mRNA microarray, mRNA sequencing and miRNA sequencing; (ii) an evaluattion performance of different methods by using classification accuracy, stability of feature space, similarity with other methods and consistency between expression data of miRNA and mRNA. Experimental results show that the proposed method has a satisfactory capability of feature selection for different expression datasets analysis compared to other state of art feature selection approaches.

Results
For measuring the performance of embedded method, we use three kernel functions, In a practical application, different kernels can combined. The features are selected and evaluated using 10-fold Cross-Validation (CV) on a variety of datasets through different feature selection methods including SVM-RFE [26], SVM-RCE [18], mRMR [25], IMRelief [10], SlimPLS [27] and SMKL-FS. We measure the performances of different feature selection algorithms through evaluating the classification accuracy of feature combination, also measuring the stability of feature space on different samples in the same type of data and the similarity with other methods.

Data sources and pre-processing
In this paper, three types of expression data are used to measure the performance of feature selection methods. We only use the paired samples in expression datasets which include tumor and adjacent non-tumor tissues. The datasets of mRNA microarray are obtained from Gene Expression Omnibus (GEO) [1], the datasets of mRNA sequencing and miRNA sequencing are downloaded from The Cancer Genome Atlas (TCGA) [4]. Eight types of cancer on microarray datasets are used in this article, and each type of cancer contains several datasets (series in GEO). Table 1 gives the more detailed information of the eight cancer types of mRNA microarray datasets from GEO and Table 2 shows the more detailed information of the eight cancer types from TCGA. For using these expression data to measure the performance of different feature selection methods, the downloaded and reorganized data from GEO and TCGA have been converted in our defined data format and preprocessed through the following processes. Firstly, the missing values of each expression dataset are estimated. If the missing values of one mRNA (or miRNA) are less than 20% of all samples, these missing values are estimated using the local least squares imputation (LLSimpute) method [28]. Then, the different probes of the same mRNA (or miRNA) are merged by the maximum expression value of these probes for each sample. After these processes, these datasets are normalized by median absolute deviation (MAD) method to make all the samples have similar background [29]. The normalization of different microarrays is applied by housekeeping gene as performed in previous article [30].

Performance measurement of feature space
The performance measurement of feature space is important for evaluating different feature selection algorithms. Most of the state of art algorithms only validate their performance through the classification accuracy [26] or classification error [31] on selected feature set by a classifier C. The classification accuracy and classification error are defined as follows respectively: where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. However, only computing the classified ability of selected features could not reflect the performance of feature selection algorithms roundly.
In this paper, we measure the performances of different feature selection algorithms through evaluating the classification accuracy of single features and features combination, also measuring the stability of feature space on different samples in the same type of data, the similarity with other methods and consistency between expression  data of miRNA and mRNA. We select and evaluated features using 10-fold Cross-Validation (CV) on these datasets mentioned above through different feature selection methods, SVM-RFE [26], SVM-RCE [18], mRMR [25], IMRelief [10], SlimPLS [27], OSFS [32], FGM [33] and our method SMKL-FS. Firstly, for each testing dataset, we randomly selected 90% as training dataset and other 10% as test dataset. Repeating the selection process 10 times, we can obtain a collection of 10 groups contained training and test samples. In order to ensure fairness, we select feature subset using each feature selection method on training samples of the same 10 groups. Then, for the ten selected features from different methods, we evaluate them according to the above criterions.

Classification accuracy of features combination
For two feature sets S 1 n and S 2 n , and the above classifier C, we consider the feature space of S 1 n is more effective, if the classification accuracy on feature set S 1 n is higher than that on S 2 n by using classifier C. If the method M 1 generates a series of feature subsets in S 1 n : S 1 1 ⊂S 1 2 ⊂…S 1 n−1 ⊂S 1 n and the method M 2 generates a series of feature subsets in S 2 n : S 2 we compute the classification accuracy on S k 1 and S k 2 as same as [8]. If the average of these classification accuracies on S 1 n is higher than that on S 2 n , we consider the method M 1 is better than M 2 in mean effectiveness. If the maximum of these classification accuracies on S 1 n is higher than that on S 2 n , we consider the method M 1 is better than M 2 in max effectiveness.
In our verification, we set the n of feature set S 1 n as 10, and compare the effectiveness of feature spaces from different methods using SVM classifier. For the feature subsets in S 1 10 : S 1 1 ⊂S 1 2 ⊂…S 1 9 ⊂S 1 10 generated by method M 1 , we compute the classification accuracy on S k 1 for every k(1 ≤ k ≤ 10). Then the mean effectiveness and max effectiveness of method M 1 are measured by the average and maximum classification accuracies on S 1 10 . The results of mean effectiveness and max effectiveness on three types of datasets through different methods are shown in Tables 3, 4 & 5 and Additional file 1:  Table S1, respectively.
The mean effectiveness and max effectiveness of SMKL-FS are better than those from other methods for most datasets of miRNA sequencing, mRNA microarray data and little less than mRMR on mRNA sequencing data. The good performance of mRMR [25] on gene expression data may attribute to the method designed specifically for this type of data. We also see that FGM [33] is the best common method, which has satisfactory performance on different type of gene expression data. The results of accuracy of each S 1 1 , S 2 1 , …, S 9 1 , S 10 1 on three types of datasets for different methods are shown (See Additional file 2: Figure S1, Additional file 3: Figure S2 and Additional file 4: Figure S3), respectively. In each subgraph, the X-axis represents different feature sets  Table S2). In Additional file 5: Table S2, the method using different individual kernels affect the results weakly, and the method using multiple kernels has the best results among the majority of the datasets.
In a practical application, the first step can be skipped. However, because of the existing irrelevant features, when only using the second step, the results are not always better than those after removing the irrelevant features, and meanwhile the process has high computational complexity. Considering the computational complexity, we only test the   Table S3. From the table, we can see that the results of only using the second step are not better than those filtering some features in the first step, and meanwhile using all features the second step has high computational complexity.

Stability of feature space
The stability of feature space generated from a feature selection algorithm reflects the robustness of the method on different samples of the same type of data [34]. For a list of feature sets S 11 n ; S 12 n ; …; S n 1k generated by method M 1 on different samples Ω 1 , Ω 2 , …, Ω k (each Ω is a subset of X) of dataset D and another list of feature sets S 21 n ; S 22 n ; …; S n 2k generated by method M 2 on samples Ω 1 , j j , we consider the method M 1 is better than M 2 in union stability of feature space. For every two samples Ω i , Ω j ∈ {Ω 1 , Ω 2 , …, Ω k }, let R 1 ij ¼ S n 1i ∩S n 1j = S n 1i ∪S n 1j and R 2 ij ¼ S n 2i ∩S n 2j = S n 2i ∪S n 2j , if the average of R 1 ij is larger than the average of R 2 ij , the method M 1 is better than M 2 in independent stability of feature space.
In our verification, we set the n of feature sets S 11 n ; S 12 n ; …; S n 1k and feature sets S 21 n ; S 22 n ; …; S n 2k to 100 and use 10-fold cross validation to measure the stability of the feature lists generated by different feature selection methods. Firstly, we randomly choose 90% of the paired samples from each dataset and iterate this process 10 times to obtain 10 different sets for each dataset. Then different feature selection methods are used to select these feature lists. Furthermore, we compute the union stability and independent stability according to the process mentioned above.
The results of union stability on three types of datasets through different methods are shown (See Additional file 7: Table S4). From Additional file 7: Table S4, the union stability of SMKL-FS is better than those from other methods on most datasets. The results of independent stability on three types of datasets through different methods are shown in Figs. 1, 2 and 3, respectively. In Figs. 1, 2, 3, the X-axis represents different datasets, and the Y-axis represents independent stability. The independent stability results of SMKL-FS are better than those from other methods on most datasets.

Similarity with other methods
The similarity between the feature space generated from one feature selection algorithm and the feature lists generated by other methods can be used to estimate the availability of the algorithm. For the feature set S 1 n generated by method M 1 of dataset D and other feature sets S 2 n ; …; S k n generated by methods M 2 , M 3 , …, M k of the same dataset D. Let If the I mean of one method is larger than other methods, the method is better than other methods in Similarity.
In our verification, we set n of feature set S 1 n to 100. Firstly, we select the feature sets   Fig. 2 The results of independent stability on different mRNA Sequencing datasets

Brief review of SVM
Several supervised learning methods, such as Support Vector Machines (SVMs) can be used to analyze data and recognize patterns by classification and regression analysis. The standard SVM algorithm was proposed by Cortes and Vapnik in 1995 [35]. Given a sample set of data points G ¼ where y i is the class label of the sample x i and the summation is taken over all the training samples. α i is the Lagrange multipliers involved in maximizing the margin of separation of the classes. K(x i , x) is a kernel which can map the feature space to a high dimensional  Fig. 3 The results of independent stability on different miRNA Sequencing datasets space. There are several popular kernels, such as linear kernel After obtaining the α, we can predict the label of a new data point by the following formula [36]: and the bias b is defined: Multiple kernel learning (MKL) In recent years, several multiple kernel learning (MKL) methods have been proposed to enhance the interpretability of the decision function and improve performances [23,24]. A convenient approach of MKL is to construct the kernel K(x i , x) as a convex combination of basis kernels [23]: where M is the number of multiple kernels. The kernel K m may be the popular kernels  mentioned above with different parameters. Each single kernel K m can either use the full set of training samples or subsets of these samples from different data sources. Then, the problem of the model is transferred to the choice of the weights d m . Actually, the standard primal MKL formulation, which just learns from objective consisting of a simple summation of base kernels subjected to mix-norm regularization, is expressed in a functional form as: where f m is a function that belongs to corresponding Hilbert space H m , and each Hilbert space H m endowed an inner product 〈⋅, ⋅ 〉 m has a unique kernel K m . However, f m k k H m is not differentiable. When f m = 0, it leads to original objective function, which is not smooth. In this article, we apply SimpleMKL [23] that uses a weighted l 2 norm regularization to calculate the upper bound of the problem through Cauchy-Schwartz inequality. The primal formulation can be replaced as: And the corresponding dual problem is given as follows where α and C are Lagrange multipliers of the constrains which related to each data point and their tolerable errors separately. Note that our new dual objective function is convex and differentiable with respect to α. At each iteration, firstly the coefficients keep unchanged, and the value of objective function is optimized. Then, the coefficients are recovered and updated with above dual variables, and this process repeats until convergence.

Feature selection algorithm
Similar to other methods [18,31], our algorithm also tries to construct an efficient process to select a compact set of features. Combined with the multiple kernel learning (MKL) method mentioned in the above section, we present a two-stage feature selection method. For expression data of a set of features, there are four major feature categories: relevant features, redundant features, irrelevant features and noisy features. For two types of expression data, the relevant features are only a very small part. Most of features are irrelevant features, which will be removed firstly by many feature selection methods for expression data analysis. So, in the first stage of our method, the relevant features are identified by measuring score of each feature using the optimizing process of MKL. If the computational complexity is considered, a small set of relevant features in the first step can be selected. In the second stage, an embedded selection scheme, i.e. the forward selection, is applied to search the subset of compact features from the candidate feature sets obtained in the first stage.

Selecting the relevant feature set
Firstly, we apply MKL to select the relevant feature set. To implement MKL approach, we select the SimpleMKL method in [23] to obtain the coefficient d m of the kernel combination . SimpleMKL used an iterative gradient descent process to perform an optimization both on the parameters of the SVM (α i ) and the kernel coefficients (d m ). There are several kernels can be used, such as linear kernel K(x i , x) = (x i , x), radial basis (RBF) function Then the optimal objective function is defined as follows: Using SimpleMKL, we can obtain the J value for each feature from the total feature set S in the process of optimizing W(α, d m ) via min d m max α W α; d m ð Þ. To select the relevant feature set, the J list for features list is computed to measure the relevance between features and samples. Finally, we sort the J list in ascend and obtain the ranked features list S r . Then, the top n* features are selected and the feature set S n Ã is obtained. The process of selecting the relevant feature set is defined (See Additional file 8: Table S5).

Selecting compact feature subsets
An embedded scheme of the sequential forward selection is utilized to search the compact feature subsets from the relevant feature set S n Ã . In general, the wrapper approaches convolve with a classifier (e.g., SVM) and the goals are to minimize the classification error of the dependent classifiers. These wrapper approaches can usually obtain low classification error for their dependent classifiers. However, they have high computational complexity and the selected features are less generalization to classifiers [31]. We use the following formula instead of classification error to carry out the embedded process.
where Z is the set containing the selected features, such as Z = {f 1 , f 2 , …, f n }. In this article, the J Z is calculated by using SimpleMKL method [23], which solves the MKL problem through a primitive formulation involving a weighted l2-norm regularization [23]. Then, a forward process is used to to select the subset with r features from S n Ã by the incremental manner. And initially, the score of J 0 is set to + ∞ and the subset Z is set to empty. We search each feature in the feature subset, such as f 1 , f 2 , …, f n , and compute the objective functions J f 1 ; J f 2 ; …; J f n using SimpleMKL. The feature f i which generates the largest ΔJ ¼ J 0 −J f i reduction is appended to Z. Then, the algorithm selects the feature f j which generates the largest ΔJ reduction from the set S n Ã −Z f gto Z. The process of incremental selection will repeat until ΔJ ≤ 0 or the given iterations. The process of selecting compact feature subsets is defined (See Additional file 8: Table S6).

Discussion and conclusions
With the development of high-throughput microarray chip and RNA sequencing technology, we can obtain a large number of expression data with different types. The researchers can acquire these data from several public databases, such as GEO, SMD, ArrayExpress and TCGA. However, because the transcriptomics experiments have high cost, most of these data have samples with small size and tens thousands genes or hundreds miRNAs. How to extract informative features from expression data effectively and robustly is a challenging and crucial problem for expression data analysis. Feature selection technique had been widely applied to select a subset of relevant features and eliminate redundant, irrelevant and noisy features. In general, most feature selection methods can be divided into three categories: filter, wrapper and embedded. Filter methods independent of the classifier, select relevant features only relying on the intrinsic properties of expression data. Filter methods contain two subclasses: univariate and multivariate. Univariate methods are processed by filtering single feature and multivariate methods are used to select features by considering combination of features. The advantages of univariate methods are fast, scalable and independent of the classifier, and the disadvantages of these methods are thoughtlessness of feature dependencies and ignoring the interaction with the classifier. The advantages of multivariate methods contain: feature dependencies, independent of the classifier and better computational complexity than wrapper methods. But the multivariate methods are slower and less scalable than univariate methods. Wrapper approaches, which can be divided into deterministic and randomized types, generate the scores for features and select them based on the classifier. The deterministic methods, which are simple, have less computational complexity and more risk of over fitting than randomized methods. But they are more prone to get a result of local optimum than randomized methods. Embedded approaches, which have lower computational complexity than wrapper methods, select optimal feature subset based on classifier construction in the combined space of feature subsets and hypotheses.
Most of above methods construct the feature selection model on individual expression data simply, and they rarely consider the effectiveness and stability on expression data from different type of expression data. In order to overcome the disadvantages of above methods, a hybrid feature selection method based on multiple kernel learning is proposed. We evaluate performance of method on expression dataset of different types. Except for comparing the classification accuracy with other methods, we also compare the performances of different algorithms through measuring the stability, similarity and consistency. The experimental results show that the proposed method has a satisfactory capability of feature selection for different expression datasets analysis.
The kernel methods and other machine learning methods always have the problem of overfitting, especially in small sample size. And, one of characteristics of gene expression data is high dimensional and small sample size. There are commonly used methodologies to avoid overfitting on machine learning: Regularization, Cross-Validation, Early Stopping and Pruning. The regularization part adds a cost term for bringing in more features with the objective function. Hence, regularization can make the coefficients for many variables to zero and hence avoid the overfitting. Cross validation can identify the methods, which may have poor performance generating by overfitting training on several datasets. The methods of early stopping try to prevent overfitting by