Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Abdelwahed, Noura Mohammed; El-Tawel, Gh. S.; Makhlouf, M. A.

doi:10.1186/s13040-022-00304-y

Research
Open access
Published: 30 September 2022

Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Noura Mohammed Abdelwahed ORCID: orcid.org/0000-0001-9656-3286¹,
Gh. S. El-Tawel² &
M. A. Makhlouf¹

BioData Mining volume 15, Article number: 24 (2022) Cite this article

3218 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Background

Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.

Method

This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.

Results

The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.

Conclusion

High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.

Peer Review reports

Introduction

Artificial intelligence (AI) is a science that plays an important role in all fields, especially in the biomedical field, and it aims to simulate reality [1, 2]. Different AI applications have been applied in this field for 20 years due to many factors, including the availability of different datasets in this field, computer devices with high capabilities and arithmetic algorithms [2]. AI has great importance, as a survey has proven that it has great effectiveness in health, and it will outperform the performance of specialists in this field. In addition, it has proven effective in cancer research [2]. Furthermore, AI has become providing human specialists with many information and accordingly, the decision is taken, as it has become one of the most important elements in the medical team [2]. It also works to improve accuracy, speed up diagnosis and discover features or genes affecting cancer as recommendations for human specialists to take into consideration [2]. AI is considered a second decision that helps the specialist make their decision [2]. AI differs from the manual method because it provides human specialists with more information and details. Its diagnosis is more accurate and efficient and does not require more labor.

The manual method may be stressful for the patient, as it puts him under great pressure and takes more time to know the results of the sample, which makes him tense [3]. Cancer has become very widespread in recent times, as it has become a major cause of disease and death [4]. It can be defined as a group of more than one disease due to abnormal cell growth or changes in genes, and it can occur anywhere in the body [5]. Many factors cause cancer including [6]: - (1) tobacco consumption, (2) poor diet, (3) lack of physical activity, (4) alcohol, (5) radiation, (6) infection, (7) genetic factors, (8) smoking and (9) age [6]. There are many different types of human cancer, but in this paper, we used some types that included Breast Invasive Carcinoma (BR), Bladder urothelial carcinoma (BL), Colon and rectum (CO), Glioblastoma multiform (GB), Head and neck squamous cell (HN), Kidney renal clear-cell (KI), Parkinson’s disease (PD), Prostate adenocarcinoma (PRAD) and Lung adenocarcinoma (LUAD).

There are enormous problems in big datasets involved in the features numbers, fitting time, classification accuracy, and model performance. Feature selection is a process for selecting the most relevant features and discarding insignificant ones. Feature selection plays a vital role in many directions to enhance the model performance [7,8,9]. This process aims to select the most relevant subset r features from the original R features set (r < R) in given datasets [9]. R includes all features in a dataset. It suffers from many problems included in high dimension, noisy, repetitive and over-fitting. The ineffective features are deleted. These features diminish the classification accuracy and waste time. By deleting irrelevant features, all previous problems are solved and improved. Feature selection procedures have three major types: filter, wrapper [9, 10], and embedded [11]. Filter procedure selects the features by evaluating their relevance of features. These features are ranked in decreased order, and low-ranking features are omitted to obtain the most relevant features [12]. The filter approach can use many measures included in gain ratio, mutual information based feature selection (MIFS), information gain based feature selection (IGF), relaxed functional dependencies [9], and chi-square [10]. This procedure does not depend on any machine learning and is faster than the wrapper procedure. Despite its simplicity, it suffers from an over-fitting problem. The best subset of features is selected depending on machine learning to estimate this subset [9, 10]. This procedure suffers from expensive computationally when applied on high dimensions. On the other hand, it guarantees to select the most relevant and effective subset of features. Feature selection is an integral part of the classification model in the embedded procedure. It is embedded in the phase of learning [11]. This procedure has many advantages, including being less computationally expensive, reducing over-fitting problems, and selecting the most accurate features. In this direction, we adopted the integration of wrapper procedure with embedded one to select the relevant features using proposed methods to minimize the previous drawbacks and maximize the classification accuracy.

Selecting influencing features is an effective step in the classification process to obtain accurate results. Many datasets always suffer from high dimensions problems, which negatively affect the model performance’s accuracy. The feature selection step is considered one of the processes that positively impact solving many problems facing different datasets. In this direction, many authors applied different feature selection algorithms to minimize processing time, over-fitting, maximize classification accuracy and find the most relevant features, which still need more researches to improve. Therefore, there are numerous different methods for feature selection to fix the previous drawbacks included in the filter, wrapper, and embedded methods. The filter method is simple, and it selects the features based on their ranking according to a class. Still, it suffers from over-fitting problems in high dimensions datasets and disregards feature dependencies. Elsadek et al. [12] proposed a method using IGF to classify six human cancer types based on DNA copy number variation (CNV) dataset. The proposed method selected 16,381 features as the most relevant features. More than one learning algorithm is applied, such as logistic regression (LR), support vector machine (SVM), random forest (RF), J48, neural network, bagging and dagging. LR learning algorithm achieved the best classification accuracy of about 85% and ROC area 0.965. Rajit et al. [13] proposed selecting best and select percentile filter methods. The proposed method used a breast cancer dataset. There are more than one learning algorithms are used. LR classifier achieved a better result. Furthermore, many filter methods are proposed by Pinar Yildirim [14]. Different filter methods are applied in Cfs Subset eval, principal component analysis (PCA), consistency subset eval, IGF, One-R attribute eval, and relief attribute eval. The proposed method used the Hepatitis datasets and proved that the Consistency Subset, IGF, One-R Attribute Eval, and Relief Attribute Eval filter methods achieved better results. In addition, Alirezanejad et al. [15] proposed a filter method for gene selection using two heuristic methods. These methods, namely, Xvariance and mutual congestion. The Xvariance gave the best results with the standard datasets, while mutual congestion enhanced the accuracy of high-dimensional datasets. Kuswanto et al. [16] proposed a comparison method for feature selection using different filtering methods. Three filtering methods included in MIFS, correlation based feature selection (CFS) and fast correlation based feature selection (FCBF) are applied. The results of these methods are forwarded to K-nearest neighbors (KNN) classifer. The results showed that the FCBF selected a small number of features, while other methods performed well. Furthermore, Ghasemi et al. [17] proposed a method using IGF and gini index to select important features. These features are used to early predict of heart disease. This proposed method aimed to minimize the dimension and maximize the performance of the diagnosis of heart disease with less medical experiments. Mahmood [18] proposed a method to minimize a dimension for facial expression recognition dataset. Two feature selection methods are applied to obtain minimum number of features included in Chi-Square and Relief-F. These methods selected the first highest six features. Four different classifiers are applied to evaluate the performance. In addition, Spencer et al. [19] proposed a method to predict heart disease dataset. Four proposed methods are used for feature selection included in ReliefF, Chi-squared, symmetrical uncertainty and PCA. Different machine learning classifiers are applied to create models for comparison. The best prediction with less subset of features is selected using Chi-Square. Mohamed et al. [20] proposed a method to obtain the most important subset of feature rather than the whole dataset. Chi-square, IG and Bat algorithm are applied for feature selection. Many varieties of classifiers are used to evaluate the model performance. Vikas et al. [21] proposed a method to minimize processing time and maximize classification accuracy using lung cancer detection. To select the most relevant features, Chi-square algorithm is applied. Two different classifiers are used to evaluate the performance included in SVM and RF.

Many authors applied wrapper methods to solve the optimization problems and to get the most important subset features using different datasets. AH et al. [22] proposed an algorithm using the wrapper approach. The proposed algorithm enhanced the basic salp swarm algorithm (SSA) to improve reliability, convergence speed, and classification accuracy. The algorithm was enhanced by adding inertia weight to achieve better results. Hegazy et al. [9] used the hybrid wrapper method by applying chaotic maps to improve the performance of the salp swarm algorithm (SSA) and overcome its drawbacks. To control the exploitation/exploration rates, they used five chaotic maps. The proposed algorithm (CSSA) was applied on twenty-seven datasets and gave the best results. Although it gave the best results using twenty-seven datasets, it did not achieve good results using high-dimensional datasets. Sanaa et al. [8] proposed a wrapper method included in particle swarm optimization (PSO) and genetic algorithm (GA) to classify six human cancers types using DNA CNV dataset. The hybrid proposed method was applied to minimize the features and maximize the classification accuracy. It selected 2051 features from 16,381 features. The selected features achieved 84.6% classification accuracy. However, it suffered from many problems included in over-fitting, fitting time, relevant features, and classification accuracy. RFE is considered a wrapper method for feature selection. It suffers from time-consuming, especially when using big data. Li et al. [23] proposed fixing the support vector machine recursive feature elimination (SVMRFE) problem. They first proposed random value-based oversampling as a resampling method. The proposed variable step size (VSSRFE) to speed up the feature selection process. Another method is proposed called linear SVM (LLSVM). The two proposed methods are used together for feature selection. Jeon et al. [24] proposed a hybrid RFE method using benchmark datasets. This proposed method used SVM-RFE, random forest RFE (RF-RFE), and gradient boosting machines RFE (GBM-RFE) methods which combined the feature-importance-based RFE methods. There were two types of weighting functions used in the proposed methods. The first type sums the weight of three proposed RFE methods, and the second one reflects the classification accuracies and weights of features. Rani et al. [25] proposed a hybrid wrapper method by integrating GA and RFE algorithms. This method is compared with other feature selection methods. The proposed method improved the classification performance after canceling irrelevant features. Zvarevashe et al. [26] proposed a method to select the most relevant subset features using RFE algorithm based on RF. The proposed method was compared with a deep learning algorithm. It proved its powerful for selecting features. Senan et al. [27] proposed a method to select the relevant features using RFE algorithm for a kidney disease dataset. Four classification algorithms are applied for the classification step. The RF algorithm gave the best results.

Many researchers used a hybrid method which combined filter and wrapper methods to select relevant features, but it had many limitations that filter method may cancel important features and wrapper methods take more time. High dimensional is another limitation when applying this hybrid [28]. Ansari et al. [10] used filter and wrapper approaches as a feature selection process. They proposed two different hybrid methods. F-score feature ranker and Chi-square feature ranker are applied in the first method and took the intersection between them. The intersection between these features is applied to obtain the most important features. The results of the intersection process are applied on binary particle swarm optimization (BPSO) as a feature optimization approach. In the second one, after the intersection between features, RFE approach is applied. Zhang et al. [7] proposed a method to classify six human cancer types using CNV level values. Zhang selected the features using the methods of mRMR (minimum Redundancy Maximum Relevance Feature selection) and IFS (Incremental Feature Selection). The first method selected features by ranking the importance of these features. This method selected 200 features. The second method used IFS to select the optimal set of features. IFS selected 19 features with an accuracy value 0.75. However, this proposed method gave insufficient classification accuracy. Pirgazi et al. [29] proposed a hybrid method using filter and wrapper for feature selection in high dimensional datasets. In the first stage, they applied a filter method using the Relief method to weight the features. In the second stage, they applied a wrapper method using shuffled frog leaping algorithm (SFLA) and IWSSr algorithms. Mandal et al. [30] proposed a hybrid method for feature selection using the filter and wrapper method. They applied MIFS, ReliefF, Chi-Square, and Xvariance for the filter method. The union for four filter methods is applied to obtain the most important features. The wrapper method is applied using Whale Optimization Algorithm to overcome any limitation in the filter method. Venkatesh et al. [31] proposed a hybrid method using MIFS as a filter method and RFE as a wrapper method. The hybrid method gave better results than the individual algorithms. Gakii et al. [32] proposed comparison methods using three algorithms for feature selection included in the PCA, RFE and graph-based feature selection. The results proved that the graph-based feature selection enhanced the performance of sequential minimal optimization and multilayer perceptron classifiers. In addition, researchers applied a hybrid method using the advantages of both wrapper and embedded methods to obtain the most effective features to solve the drawbacks in the previous studies. Liu et al. [28] proposed a hybrid method using GA as a global search with an embedded regularization approach as a local search. They proposed this method to solve the over-fitting problems and select relevant features. It is compared with individual algorithms, proving its effectiveness for feature selection. Aruna et al. [33] proposed a hybrid method using LR and RFE algorithms for the diabetes dataset. The RFE is based on LR as an estimator. The RF is applied for a classification step. Venkatachalam et al. [34] proposed a hybrid method that combined the ridge regression and RFE algorithms. It solved the problem of over-fitting for feature selection. The proposed method is compared with other models. RF is applied for the classification step.

Due to the previous research gaps, this paper presents the proposed method PFBS-RFS-RFE with three positions to fix feature selection problems and improve the classification model over different datasets. It tries to enhance many issues included in time consuming using RFE algorithm, classification accuracy, over-fitting problems, fitting time and select the most effective features to know the chromosome that is considered the most developing human cancers in the datasets. Furthermore, we applied a resampling method to enhance the classification accuracy and improve the over-fitting problem [35]. The bootstrap is a resampling method that reduces the variance and bias between features; therefore, the over-fitting problem is minimized, and classification accuracy is maximized. We utilize PFBS as a resampling step with the hybrid RFS-RFE to reduce the over-fitting problem and improve the classification accuracy. We compared the proposed methods with RFE, RFS, and with previous work over five datasets. Four efficient supervised machine learning were used to evaluate the model performance of the proposed hybrid feature selection methods. The main contributions are summarized as follows: -

1.
We propose hybrid methods, namely, positions first bootstrap step random forest selection recursive feature elimination (PFBS-RFS-RFE) based on feature selection that combines the advantages of the wrapper and embedded methods to solve many feature selection problems, including over-fitting, time consuming, relevant features, classification accuracy and solving the problem in RFE algorithm, which suffers from time-consuming with high-dimensional datasets.
2.
The motivation behind the proposed methods is to know the genes or features associated with cancers; therefore, we can know the chromosome that is considered the most developing human cancers by taking the average number of runs and the intersection between features.

The structure of the article is as follows. The “Introduction” section presents the feature selection troubles and how previous work tried to solve them. The “Results” section presents the results of hybrid algorithm and the comparison with other studies using the same datasets. The “Discussion” section summarizes and discusses the application of the hybrid algorithm. The “Conclusions” section presents the main idea and the importance of the proposed methods. The “Method” section presents the hybrid algorithm to enhance and solve these troubles.

Results

The hybrid proposed methods applied two important stages included in feature selection and model performance. They are applied using proposed datasets to select the effective cancer genes and improve the drawbacks included in over-fitting and classification accuracy. The selected features are utilized to feed more than one classifier using 10 cross-validations. The proposed classifiers are LR, support vector machine (SVM), RF and bagging (Bagg). The proposed method is compared with the individual algorithm such as RFE and RF and with the previous work. The proposed methods confirmed the results.

Performance metrics

Performance evaluation is a very important step in machine learning. Selecting the most relevant features increases the classification accuracy and decreases the classification error. We proposed a hybrid method to obtain the accurate classification value, therefore; we fixed any previous drawbacks. The proposed methods are compared with individual algorithms included in RFE and RFS using the following metrics: -

The size of feature selection: - is the number of selected features.
Processing time: - is the time of the fitting process in second.
Performance accuracy is the percentage of the samples that are correctly evaluated by a classifier.
Performance evaluation included: - Precision, F1-score, Recall, variance, Receiver operating characteristic (ROC) area, and Area under curve (AUC) [8, 12] is used to measure the classification performance by plotting the relationship between True Positive (TP) and False Positive (FP) rates.
The calculation formula is applied to evaluate the model performance using ensemble and regularization classifiers with 10 cross-validation. Table 1 presents the meanings of the symbols that used in the proposed methods. The calculation formula is as follows: -

Table 1 The meanings of the symbol

Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Abstract

Background

Method

Results

Conclusion

Introduction

Results

Performance metrics

Parameter setting

Numerical results and discussion

Comparison with other studies

Discussion

Conclusions

Method

Dataset description

The proposed hybrid feature selection methods

First bootstrap step as a resampling method

Feature selection using random Forest (RFS)

Recursive feature elimination (RFE)

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us