Skip to main content

Effective hybrid feature selection using different bootstrap enhances cancers classification performance

Abstract

Background

Machine learning can be used to predict the different onset of human cancers. Highly dimensional data have enormous, complicated problems. One of these is an excessive number of genes plus over-fitting, fitting time, and classification accuracy. Recursive Feature Elimination (RFE) is a wrapper method for selecting the best subset of features that cause the best accuracy. Despite the high performance of RFE, time computation and over-fitting are two disadvantages of this algorithm. Random forest for selection (RFS) proves its effectiveness in selecting the effective features and improving the over-fitting problem.

Method

This paper proposed a method, namely, positions first bootstrap step (PFBS) random forest selection recursive feature elimination (RFS-RFE) and its abbreviation is PFBS- RFS-RFE to enhance cancer classification performance. It used a bootstrap with many positions included in the outer first bootstrap step (OFBS), inner first bootstrap step (IFBS), and outer/ inner first bootstrap step (O/IFBS). In the first position, OFBS is applied as a resampling method (bootstrap) with replacement before selection step. The RFS is applied with bootstrap = false i.e., the whole datasets are used to build each tree. The importance features are hybrid with RFE to select the most relevant subset of features. In the second position, IFBS is applied as a resampling method (bootstrap) with replacement during applied RFS. The importance features are hybrid with RFE. In the third position, O/IFBS is applied as a hybrid of first and second positions. RFE used logistic regression (LR) as an estimator. The proposed methods are incorporated with four classifiers to solve the feature selection problems and modify the performance of RFE, in which five datasets with different size are used to assess the performance of the PFBS-RFS-RFE.

Results

The results showed that the O/IFBS-RFS-RFE achieved the best performance compared with previous work and enhanced the accuracy, variance and ROC area for RNA gene and dermatology erythemato-squamous diseases datasets to become 99.994%, 0.0000004, 1.000 and 100.000%, 0.0 and 1.000, respectively.

Conclusion

High dimensional datasets and RFE algorithm face many troubles in cancers classification performance. PFBS-RFS-RFE is proposed to fix these troubles with different positions. The importance features which extracted from RFS are used with RFE to obtain the effective features.

Peer Review reports

Introduction

Artificial intelligence (AI) is a science that plays an important role in all fields, especially in the biomedical field, and it aims to simulate reality [1, 2]. Different AI applications have been applied in this field for 20 years due to many factors, including the availability of different datasets in this field, computer devices with high capabilities and arithmetic algorithms [2]. AI has great importance, as a survey has proven that it has great effectiveness in health, and it will outperform the performance of specialists in this field. In addition, it has proven effective in cancer research [2]. Furthermore, AI has become providing human specialists with many information and accordingly, the decision is taken, as it has become one of the most important elements in the medical team [2]. It also works to improve accuracy, speed up diagnosis and discover features or genes affecting cancer as recommendations for human specialists to take into consideration [2]. AI is considered a second decision that helps the specialist make their decision [2]. AI differs from the manual method because it provides human specialists with more information and details. Its diagnosis is more accurate and efficient and does not require more labor.

The manual method may be stressful for the patient, as it puts him under great pressure and takes more time to know the results of the sample, which makes him tense [3]. Cancer has become very widespread in recent times, as it has become a major cause of disease and death [4]. It can be defined as a group of more than one disease due to abnormal cell growth or changes in genes, and it can occur anywhere in the body [5]. Many factors cause cancer including [6]: - (1) tobacco consumption, (2) poor diet, (3) lack of physical activity, (4) alcohol, (5) radiation, (6) infection, (7) genetic factors, (8) smoking and (9) age [6]. There are many different types of human cancer, but in this paper, we used some types that included Breast Invasive Carcinoma (BR), Bladder urothelial carcinoma (BL), Colon and rectum (CO), Glioblastoma multiform (GB), Head and neck squamous cell (HN), Kidney renal clear-cell (KI), Parkinson’s disease (PD), Prostate adenocarcinoma (PRAD) and Lung adenocarcinoma (LUAD).

There are enormous problems in big datasets involved in the features numbers, fitting time, classification accuracy, and model performance. Feature selection is a process for selecting the most relevant features and discarding insignificant ones. Feature selection plays a vital role in many directions to enhance the model performance [7,8,9]. This process aims to select the most relevant subset r features from the original R features set (r < R) in given datasets [9]. R includes all features in a dataset. It suffers from many problems included in high dimension, noisy, repetitive and over-fitting. The ineffective features are deleted. These features diminish the classification accuracy and waste time. By deleting irrelevant features, all previous problems are solved and improved. Feature selection procedures have three major types: filter, wrapper [9, 10], and embedded [11]. Filter procedure selects the features by evaluating their relevance of features. These features are ranked in decreased order, and low-ranking features are omitted to obtain the most relevant features [12]. The filter approach can use many measures included in gain ratio, mutual information based feature selection (MIFS), information gain based feature selection (IGF), relaxed functional dependencies [9], and chi-square [10]. This procedure does not depend on any machine learning and is faster than the wrapper procedure. Despite its simplicity, it suffers from an over-fitting problem. The best subset of features is selected depending on machine learning to estimate this subset [9, 10]. This procedure suffers from expensive computationally when applied on high dimensions. On the other hand, it guarantees to select the most relevant and effective subset of features. Feature selection is an integral part of the classification model in the embedded procedure. It is embedded in the phase of learning [11]. This procedure has many advantages, including being less computationally expensive, reducing over-fitting problems, and selecting the most accurate features. In this direction, we adopted the integration of wrapper procedure with embedded one to select the relevant features using proposed methods to minimize the previous drawbacks and maximize the classification accuracy.

Selecting influencing features is an effective step in the classification process to obtain accurate results. Many datasets always suffer from high dimensions problems, which negatively affect the model performance’s accuracy. The feature selection step is considered one of the processes that positively impact solving many problems facing different datasets. In this direction, many authors applied different feature selection algorithms to minimize processing time, over-fitting, maximize classification accuracy and find the most relevant features, which still need more researches to improve. Therefore, there are numerous different methods for feature selection to fix the previous drawbacks included in the filter, wrapper, and embedded methods. The filter method is simple, and it selects the features based on their ranking according to a class. Still, it suffers from over-fitting problems in high dimensions datasets and disregards feature dependencies. Elsadek et al. [12] proposed a method using IGF to classify six human cancer types based on DNA copy number variation (CNV) dataset. The proposed method selected 16,381 features as the most relevant features. More than one learning algorithm is applied, such as logistic regression (LR), support vector machine (SVM), random forest (RF), J48, neural network, bagging and dagging. LR learning algorithm achieved the best classification accuracy of about 85% and ROC area 0.965. Rajit et al. [13] proposed selecting best and select percentile filter methods. The proposed method used a breast cancer dataset. There are more than one learning algorithms are used. LR classifier achieved a better result. Furthermore, many filter methods are proposed by Pinar Yildirim [14]. Different filter methods are applied in Cfs Subset eval, principal component analysis (PCA), consistency subset eval, IGF, One-R attribute eval, and relief attribute eval. The proposed method used the Hepatitis datasets and proved that the Consistency Subset, IGF, One-R Attribute Eval, and Relief Attribute Eval filter methods achieved better results. In addition, Alirezanejad et al. [15] proposed a filter method for gene selection using two heuristic methods. These methods, namely, Xvariance and mutual congestion. The Xvariance gave the best results with the standard datasets, while mutual congestion enhanced the accuracy of high-dimensional datasets. Kuswanto et al. [16] proposed a comparison method for feature selection using different filtering methods. Three filtering methods included in MIFS, correlation based feature selection (CFS) and fast correlation based feature selection (FCBF) are applied. The results of these methods are forwarded to K-nearest neighbors (KNN) classifer. The results showed that the FCBF selected a small number of features, while other methods performed well. Furthermore, Ghasemi et al. [17] proposed a method using IGF and gini index to select important features. These features are used to early predict of heart disease. This proposed method aimed to minimize the dimension and maximize the performance of the diagnosis of heart disease with less medical experiments. Mahmood [18] proposed a method to minimize a dimension for facial expression recognition dataset. Two feature selection methods are applied to obtain minimum number of features included in Chi-Square and Relief-F. These methods selected the first highest six features. Four different classifiers are applied to evaluate the performance. In addition, Spencer et al. [19] proposed a method to predict heart disease dataset. Four proposed methods are used for feature selection included in ReliefF, Chi-squared, symmetrical uncertainty and PCA. Different machine learning classifiers are applied to create models for comparison. The best prediction with less subset of features is selected using Chi-Square. Mohamed et al. [20] proposed a method to obtain the most important subset of feature rather than the whole dataset. Chi-square, IG and Bat algorithm are applied for feature selection. Many varieties of classifiers are used to evaluate the model performance. Vikas et al. [21] proposed a method to minimize processing time and maximize classification accuracy using lung cancer detection. To select the most relevant features, Chi-square algorithm is applied. Two different classifiers are used to evaluate the performance included in SVM and RF.

Many authors applied wrapper methods to solve the optimization problems and to get the most important subset features using different datasets. AH et al. [22] proposed an algorithm using the wrapper approach. The proposed algorithm enhanced the basic salp swarm algorithm (SSA) to improve reliability, convergence speed, and classification accuracy. The algorithm was enhanced by adding inertia weight to achieve better results. Hegazy et al. [9] used the hybrid wrapper method by applying chaotic maps to improve the performance of the salp swarm algorithm (SSA) and overcome its drawbacks. To control the exploitation/exploration rates, they used five chaotic maps. The proposed algorithm (CSSA) was applied on twenty-seven datasets and gave the best results. Although it gave the best results using twenty-seven datasets, it did not achieve good results using high-dimensional datasets. Sanaa et al. [8] proposed a wrapper method included in particle swarm optimization (PSO) and genetic algorithm (GA) to classify six human cancers types using DNA CNV dataset. The hybrid proposed method was applied to minimize the features and maximize the classification accuracy. It selected 2051 features from 16,381 features. The selected features achieved 84.6% classification accuracy. However, it suffered from many problems included in over-fitting, fitting time, relevant features, and classification accuracy. RFE is considered a wrapper method for feature selection. It suffers from time-consuming, especially when using big data. Li et al. [23] proposed fixing the support vector machine recursive feature elimination (SVMRFE) problem. They first proposed random value-based oversampling as a resampling method. The proposed variable step size (VSSRFE) to speed up the feature selection process. Another method is proposed called linear SVM (LLSVM). The two proposed methods are used together for feature selection. Jeon et al. [24] proposed a hybrid RFE method using benchmark datasets. This proposed method used SVM-RFE, random forest RFE (RF-RFE), and gradient boosting machines RFE (GBM-RFE) methods which combined the feature-importance-based RFE methods. There were two types of weighting functions used in the proposed methods. The first type sums the weight of three proposed RFE methods, and the second one reflects the classification accuracies and weights of features. Rani et al. [25] proposed a hybrid wrapper method by integrating GA and RFE algorithms. This method is compared with other feature selection methods. The proposed method improved the classification performance after canceling irrelevant features. Zvarevashe et al. [26] proposed a method to select the most relevant subset features using RFE algorithm based on RF. The proposed method was compared with a deep learning algorithm. It proved its powerful for selecting features. Senan et al. [27] proposed a method to select the relevant features using RFE algorithm for a kidney disease dataset. Four classification algorithms are applied for the classification step. The RF algorithm gave the best results.

Many researchers used a hybrid method which combined filter and wrapper methods to select relevant features, but it had many limitations that filter method may cancel important features and wrapper methods take more time. High dimensional is another limitation when applying this hybrid [28]. Ansari et al. [10] used filter and wrapper approaches as a feature selection process. They proposed two different hybrid methods. F-score feature ranker and Chi-square feature ranker are applied in the first method and took the intersection between them. The intersection between these features is applied to obtain the most important features. The results of the intersection process are applied on binary particle swarm optimization (BPSO) as a feature optimization approach. In the second one, after the intersection between features, RFE approach is applied. Zhang et al. [7] proposed a method to classify six human cancer types using CNV level values. Zhang selected the features using the methods of mRMR (minimum Redundancy Maximum Relevance Feature selection) and IFS (Incremental Feature Selection). The first method selected features by ranking the importance of these features. This method selected 200 features. The second method used IFS to select the optimal set of features. IFS selected 19 features with an accuracy value 0.75. However, this proposed method gave insufficient classification accuracy. Pirgazi et al. [29] proposed a hybrid method using filter and wrapper for feature selection in high dimensional datasets. In the first stage, they applied a filter method using the Relief method to weight the features. In the second stage, they applied a wrapper method using shuffled frog leaping algorithm (SFLA) and IWSSr algorithms. Mandal et al. [30] proposed a hybrid method for feature selection using the filter and wrapper method. They applied MIFS, ReliefF, Chi-Square, and Xvariance for the filter method. The union for four filter methods is applied to obtain the most important features. The wrapper method is applied using Whale Optimization Algorithm to overcome any limitation in the filter method. Venkatesh et al. [31] proposed a hybrid method using MIFS as a filter method and RFE as a wrapper method. The hybrid method gave better results than the individual algorithms. Gakii et al. [32] proposed comparison methods using three algorithms for feature selection included in the PCA, RFE and graph-based feature selection. The results proved that the graph-based feature selection enhanced the performance of sequential minimal optimization and multilayer perceptron classifiers. In addition, researchers applied a hybrid method using the advantages of both wrapper and embedded methods to obtain the most effective features to solve the drawbacks in the previous studies. Liu et al. [28] proposed a hybrid method using GA as a global search with an embedded regularization approach as a local search. They proposed this method to solve the over-fitting problems and select relevant features. It is compared with individual algorithms, proving its effectiveness for feature selection. Aruna et al. [33] proposed a hybrid method using LR and RFE algorithms for the diabetes dataset. The RFE is based on LR as an estimator. The RF is applied for a classification step. Venkatachalam et al. [34] proposed a hybrid method that combined the ridge regression and RFE algorithms. It solved the problem of over-fitting for feature selection. The proposed method is compared with other models. RF is applied for the classification step.

Due to the previous research gaps, this paper presents the proposed method PFBS-RFS-RFE with three positions to fix feature selection problems and improve the classification model over different datasets. It tries to enhance many issues included in time consuming using RFE algorithm, classification accuracy, over-fitting problems, fitting time and select the most effective features to know the chromosome that is considered the most developing human cancers in the datasets. Furthermore, we applied a resampling method to enhance the classification accuracy and improve the over-fitting problem [35]. The bootstrap is a resampling method that reduces the variance and bias between features; therefore, the over-fitting problem is minimized, and classification accuracy is maximized. We utilize PFBS as a resampling step with the hybrid RFS-RFE to reduce the over-fitting problem and improve the classification accuracy. We compared the proposed methods with RFE, RFS, and with previous work over five datasets. Four efficient supervised machine learning were used to evaluate the model performance of the proposed hybrid feature selection methods. The main contributions are summarized as follows: -

  1. 1.

    We propose hybrid methods, namely, positions first bootstrap step random forest selection recursive feature elimination (PFBS-RFS-RFE) based on feature selection that combines the advantages of the wrapper and embedded methods to solve many feature selection problems, including over-fitting, time consuming, relevant features, classification accuracy and solving the problem in RFE algorithm, which suffers from time-consuming with high-dimensional datasets.

  2. 2.

    The motivation behind the proposed methods is to know the genes or features associated with cancers; therefore, we can know the chromosome that is considered the most developing human cancers by taking the average number of runs and the intersection between features.

The structure of the article is as follows. The “Introduction” section presents the feature selection troubles and how previous work tried to solve them. The “Results” section presents the results of hybrid algorithm and the comparison with other studies using the same datasets. The “Discussion” section summarizes and discusses the application of the hybrid algorithm. The “Conclusions” section presents the main idea and the importance of the proposed methods. The “Method” section presents the hybrid algorithm to enhance and solve these troubles.

Results

The hybrid proposed methods applied two important stages included in feature selection and model performance. They are applied using proposed datasets to select the effective cancer genes and improve the drawbacks included in over-fitting and classification accuracy. The selected features are utilized to feed more than one classifier using 10 cross-validations. The proposed classifiers are LR, support vector machine (SVM), RF and bagging (Bagg). The proposed method is compared with the individual algorithm such as RFE and RF and with the previous work. The proposed methods confirmed the results.

Performance metrics

Performance evaluation is a very important step in machine learning. Selecting the most relevant features increases the classification accuracy and decreases the classification error. We proposed a hybrid method to obtain the accurate classification value, therefore; we fixed any previous drawbacks. The proposed methods are compared with individual algorithms included in RFE and RFS using the following metrics: -

  • The size of feature selection: - is the number of selected features.

  • Processing time: - is the time of the fitting process in second.

  • Performance accuracy is the percentage of the samples that are correctly evaluated by a classifier.

  • Performance evaluation included: - Precision, F1-score, Recall, variance, Receiver operating characteristic (ROC) area, and Area under curve (AUC) [8, 12] is used to measure the classification performance by plotting the relationship between True Positive (TP) and False Positive (FP) rates.

  • The calculation formula is applied to evaluate the model performance using ensemble and regularization classifiers with 10 cross-validation. Table 1 presents the meanings of the symbols that used in the proposed methods. The calculation formula is as follows: -

Table 1 The meanings of the symbol
$$\mathrm{Precision}\ \left(\mathrm{PPV}\right)=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(1)
$$\mathrm{Recall}\ \left(\mathrm{Sensitivity}\right)=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(2)
$$\mathrm{F}1\text-\mathrm{Score}=\frac{2\ast \mathrm{Precision}\ast \mathrm{Recall}\ }{\mathrm{Precision}+\mathrm{Recall}\ }$$
(3)
$$\mathrm{ACC}\ \left(\mathrm{Accuracy}\right)=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FN}+\mathrm{FP}}$$
(4)

Parameter setting

The experiments were run in Python on a pc with windows 10, R TM CPU 1.80 GHz, and 8 GB memory. All parameter values are determined based on domain-specific knowledge or trial and error. The parameter setting for all proposed methods is given in Table 2, with a simple declaration for each parameter.

Table 2 The meaning of parameter setting

Numerical results and discussion

The fundamental goal of these proposed methods is to enhance the performance of RFE to reach the optimum subset features that show the most associated features (genes) with cancers. Another goal of the proposed methods is to solve and fix the problem of over-fitting between training and testing data. The proposed method was compared with the original algorithms included in RFE and RFS. Table 3 presents the performance of the individual algorithms such as RFE and RF using the proposed classifiers LR with 10 folds stratified cross-validation before applying the feature selection proposed methods. Stratified cross-validation splits data into folds to ensure that the ratio between label classes is the same in each fold as in the full data.

In Table 3, the RFE algorithm spent more time on feature selection with high-dimensional datasets. Therefore, it did not achieve good results for classification accuracy. The Parkinson’s disease dataset shows that the classification accuracy achieved low results before applying the proposed methods. Using the BreastEw dataset, we can notice that both RFE and RFS achieved the best results before applying the proposed methods. Still, we need to reach optimal classification accuracy with the smallest subset features. The terms Algo., over-fitting Diff., Pre, Rec, NO.F, F-Time, C-Time, and var. referred to proposed algorithms, difference percentage between training and testing dataset, Precision, Recall, Number of selected features, Fitting time of feature selection, classification fitting time and variance, respectively.

Table 3 Performance of original algorithms before applying the proposed methods

We noticed the previous results that the single algorithms suffered from many problems in the fitting time of feature selection (F-Time), classification fitting time (C-Time), number of selected features, over-fitting, and classification accuracy. Therefore, we proposed the methods to fix any previous problems in original algorithms when run as a single algorithm and obtain the most effective cancers genes. In addition, we noticed that the single algorithms did not give the best results, so we applied a hybrid method using the wrapper and embedded procedure.

In Table 4, the average results of the proposed method OFBS-RFS-RFE are presented using stratified cross-validation with proposed classifiers included in LR, SVM, RF and Bagg. The proposed methods are run 2o times to obtain the best results. The PFBS has many positions of the first bootstrap step included in OFBS, IFBS and both outer and O/IFBS. The following table presented the OFBS-RFSRFE after 20 runs.

Table 4 Average results after applying OFBS-RFS-RFE after 20 runs

For more illustration, in Table 4, the proposed method using OFBS-RFS-RFE enhanced the performance of RFE algorithm. The over-fitting percentage was reduced from the RNA gene dataset after applying previous classifiers, so the accuracy difference between training and testing dataset was reduced compared with the single algorithm. The LR classifier achieved the best classification accuracy result with 99.981%, while the SVM classifier gave the best variance result with 0.0000002. From DNA CNV dataset the difference between training and testing became 2.442 and 2.763% using LR and Bagg classifiers, respectively, and the accuracy results were increased with 91.020 and 92.762%, respectively using the same classifiers. In addition, the variance between features was reduced using the same classifiers to become 0.00028 and 0.00023, respectively. The OFBS-RFS-RFE enhances the over-fitting and variance and minimizes features’ fitting time and number. From the Parkinson’s disease dataset, the classification accuracy, precision, recall, f1-score, AUC and variance are enhanced to 95.000%, 0.945, 0.906, 0.922, 0.985 and 0.00062, respectively using RF classifier. It suggested that only 113.85 features were good enough for the classification step with 1.134 s as a computational time. In addition, for dermatology erythemato-squamous diseases dataset, RF classifier gave the best classification accuracy, precision, recall, f1-score, AUC and variance to become 100.000%, 1.000, 1.000, 1.000, 1.000 and 0.0. On the other hand, the OFBS-RFS-RFE using the BreastEw dataset achieved the best computational time after applying LR and SVM in contrast with the other optimizer. We can notice that the RF gave the best over-fitting percentage, precision, recall, f1-score, AUC, variance, and accuracy to become 2.00%, 0.983, .979, 0.982, 0.997, 0.000302 and 98%, respectively.

In Table 5, the average results of the proposed method PFBS-RFS-RFE using IFBS after 20 runs are presented. The different positions of bootstrap lead to different results. The IFBS used the bootstrap step inside the RFS algorithm for feature selection.

Table 5 Average results after applying IFBS-RFS-RFE after 20 runs

For more illustration, in Table 5, the SVM classifier achieved the best classification accuracy and variance results with 99.988% and 0.0000002, respectively. Although the inner position gave the best results using RNA gene dataset, but it did not give the best result for other datasets.

In Table 6, the average results of PFBS-RFS-RFE using O/IFBS after 20 runs are presented. In this position the FBS is placed before selecting the features and during the feature selecting algorithm.

Table 6 Average results after applying O/IFBS-RFS-RFE after 20 runs

For more illustration, in Table 6, the accuracy and variance results are increased from the RNA gene dataset to 99.994% and 0.0000004, respectively, using LR classifier. Bagg classifier gave the best accuracy and variance results using DNA CNV dataset to become 92.834% and 0.00027, respectively. In addition, RF classifier gave the best accuracy and variance using dermatology erythemato-squamous diseases dataset to become 100% and 0.0, respectively. At the same time, the O/IFBS-RFS-RFE did not give good results for other datasets.

In Fig. 1, the classification accuracy using the proposed methods is illustrated using all datasets. We can notice that RNA gene dataset achieved the best results with O/IFBS using LR classifier, while the DNA CNV dataset achieved the best results with O/IFBS using Bagg classifier. In addition, the Parkinson’s disease dataset achieved the best results with OFBS using LR classifier. The dermatology erythemato-squamous diseases and breast datasets achieved the best result using RF classifier with both OFBS and O/IFBS.

Fig. 1
figure 1

Comparison between proposed methods on all datasets using classification accuracy

In Fig. 2, the number of selected features using the proposed methods is showed on all datasets. From this figure, we can note that the best algorithm that gave the smallest number of features was O/IFBS with RNA gene, Parkinson’s disease, dermatology erythemato-squamous diseases and breast datasets. On the other hand, the IFBS algorithm achieved the smallest number of features using DNA CNV dataset.

Fig. 2
figure 2

Number of the selected features using all datasets

In Fig. 3, the variance of the proposed methods is illustrated. We can notice that the RNA gene dataset using LR and SVM classifiers gave the best variance results with all position of bootstrap. On the other hand, the DNA CNV dataset achieved the best variance result using the Bagg classifier with OFBS. In addition, the Parkinson’s disease dataset achieved the best variance result using SVM classifier with OFBS. OFBS and O/IFBS achieved the best variance result using RF and Bagg classifiers for dermatology erythemato-squamous diseases dataset. For Breast dataset, the RF classifier gave the best results with OFBS.

Fig. 3
figure 3

Variance of the proposed methods using all bootstrap positions

Comparison with other studies

The results before and after PFBS-RFS-RFE are compared. In addition, these results are compared with the previous work using the same datasets. Table 7 showed the comparison before and after applying PFBS-RFS-RFE after 20 runs. The proposed methods improved the results and solved feature selection problems in high dimensions. Table 8 presented the results of the previous studies using the same dataset.

Table 7 The comparison between results before and after PFBS-RFS-RFE
Table 8 Achievement of accuracy in different research for cancer classification using the same datasets [7,8,9, 12, 36, 37]

The proposed methods were compared with filter ones methods using MIFS, IGF and mRMR. Tables 9, 10 and 11 showed the results of MIFS, IGF and mRMR for all datasets. For MIFS method, the results proved that the LR classifier gave the best accuracy for RNA gene and DNA CNV datasets, while the RF classifier gave the best accuracy for Parkinson’s disease and BreastEW datasets. In addition, SVM classifier gave the best results for dermatology erythemato-squamous diseases dataset. For IGF method, LR classifier gave the best accuracy for RNA gene dataset. SVM classifier gave the best results for DNA CNV and dermatology erythemato-squamous diseases datasets, while the RF classifier gave the best accuracy for Parkinson’s disease and BreastEW datasets. Furthermore, mRMR achieved the best results for RNA gene dataset using LR classifer, while SVM classifier gave the best results for DNA CNV dataset. In addition, RF classifer achieved the best results for dermatology erythemato-squamous diseases, Parkinson’s disease and BreastEW datasets. Although filter ones methods improved the results, they did not give better results than the PFBS-RFS-RFE.

Table 9 The proposed methods compared with the MIFS method
Table 10 The proposed methods compared with the IGF method
Table 11 The proposed methods compared with the mRMR method

The proposed methods were compared with many different filters methods as cited in the introduction section included in CfsSubsetEval, ReliefAttributeEval, OneRAttributeEval, ConsistencySubsetEval and PCA methods. Tables 12, 13, 14, 15 and 16 showed the results of these different filters methods. The ReliefAttributeEval method achieved the best results for RNA gene and BreastEW datasets, while ConsistencySubsetEval method gave the best results for DNA CNV dataset. In addition, CfsSubsetEval method gave the best results for Parkinson’s disease dataset, while the PCA method gave the best results for dermatology erythemato-squamous diseases dataset. Although filter methods improved the results, they did not give better results than the PFBS-RFS-RFE.

Table 12 The proposed methods compared with the CfsSubsetEval method
Table 13 The proposed methods compared with the ReliefAttributeEval method
Table 14 The proposed methods compared with the OneRAttributeEval method
Table 15 The proposed methods compared with the ConsistencySubsetEval method
Table 16 The proposed methods compared with the PCA method

Table 17 showed the comparison between the proposed methods, MIFS, CBF and FCBF methods as cited in the introduction section. The CBF gave the best results for RNA gene dataset, while FCBF method gave the best results for DNA CNV, Parkinson’s disease and BreastEW datasets. In addition, MIFS gave the best results for dermatology erythemato-squamous diseases dataset. These methods did not give the best results when compared with the PFBS-RFS-RFE.

Table 17 The proposed methods compared with the MIFS, CBF and FCBF methods

Table 18 showed the proposed methods compared with the Chi-square method as cited in the introduction section using SVM and RF classifiers. The SVM classifiers gave the best results for RNA gene and DNA CNV datasets, while RF classifier gave the best results for, Parkinson’s disease, BreastEW and dermatology erythemato-squamous diseases datasets. This method did not give the best results when compared with the PFBS-RFS-RFE.

Table 18 The proposed methods compared with the Chi-square method

Table 19 showed the proposed methods compared with the IGF, Chi-square and Bat algorithm as cited in the introduction section. The Bat algorithm gave the best results for RNA gene, DNA CNV and BreastEW datasets, while Chi-square method gave the best results for Parkinson’s disease dataset. In addition, the IGF method gave the best results for dermatology erythemato-squamous diseases dataset. These methods did not give the best results when compared with the PFBS-RFS-RFE.

Table 19 The proposed methods compared with the IGF, Chi-square and Bat algorithm methods

Table 20 showed the comparison between the PFBS-RFS-RFE and other filter ones methods. The results showed that the PFBS-RFS-RFE gave the best results when compared with other filter ones methods.

Table 20 The comparison between the PFBS-RFS-RFE and other filter ones methods

The proposed methods were compared with some hybrid-recursive feature elimination methods as cited in the introduction section. Table 21 showed the results of the hybrid-recursive feature elimination methods for all datasets using RFE and LR. The results proved that this hybrid method gave the best results for RNA Gene, dermatology erythemato-squamous diseases and BreastEW datasets. This hybrid method did not give the best results when compared with the PFBS-RFS-RFE.

Table 21 The proposed methods compared with the hybrid of MIFS and RFE

Another hybrid method is applied to show the comparison between the proposed method and hybrid method using GA and RFE. Table 22 showed the results of the hybrid method using GA and RFE. The results proved that this hybrid method gave the best results for RNA gene and BreastEW datasets. This hybrid method did not give the best result when compared with the PFBS-RFS-RFE.

Table 22 The proposed methods compared with the hybrid of GA and RFE

In addition, the proposed method was compared with another hybrid method using ridge regression and RFE. Table 23 showed the results of the hybrid method using ridge regression and RFE. The results proved that this hybrid method gave the best results for RNA gene, dermatology erythemato-squamous diseases and BreastEW datasets. This hybrid method did not give the best result when compared with the PFBS-RFS-RFE.

Table 23 The proposed methods compared with the hybrid of Ridge regression and RFE

Table 24 showed the comparison between the PFBS-RFS-RFE and other RFE hybrid methods. The results showed that the PFBS-RFS-RFE gave the best results when compared with other RFE hybrid methods.

Table 24 The comparison between the PFBS-RFS-RFE and other RFE hybrid methods

After the number of runs, the selected features are intersected to know the genes (features) associated with cancers which considered the most developing human cancers. Table 25 presented the features after the intersection, which played an important role in knowing the most genes and features developing human cancers.

Table 25 The selected features after intersection [38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58]

For DNA CNV dataset, the PHACTR4 was associated with prostate, breast and colon cancer [59], while RPA2 was associated with breast cancer [41]. We can notice that the proposed method achieved the best results and reached the most effective genes that develop human cancer. For dermatology erythemato-squamous diseases dataset, the age, itching and spongiosis features were associated with psoriasis dis- ease [56, 58].

Discussion

The proposed PFBS-RFS-RFE was applied to classify different human cancer using big, medium and small datasets and other medical dataset. It used five different datasets. PFBS-RFS-RFE was proposed to enhance drawbacks included in over-fitting, time-consuming, high dimension, variance and classification accuracy. The PFBS was applied in different position to obtain different results. It was applied using three positions outer, inner and outer/inner. After applying PFBS, the RFS algorithm for feature selection was applied to select the most relevant features and reduce time consumption in RFE algorithm. RFE algorithm was used to obtain the final relevant subset of features with higher classification accuracy results.

The OFBS-RFS-RFE method achieved the best results using all datasets. The RF classifier achieved the best classification accuracy with 100% using dermatology erythemato-squamous diseases dataset with 0.0 variance results. The features and time were reduced to become 16.000 and 0.500, respectively. Furthermore, LR classifier achieved the best classification accuracy result with 99.981% using RNA gene dataset, while the SVM classifier gave the best variance result with 0.0000002. The number of features and time were reduced to become 142.500 and 0.192 s, respectively. From DNA CNV dataset the difference between training and testing was reduced using LR and Bagg classifiers, and the accuracy results were increased with 91.020 and 92.762%, respectively using the same classifiers. In addition, the OFBS-RFS-RFE reduced the variance between features to become 0.00028 and 0.00023, respectively, using the previous classifiers. The number of features and time were reduced to become 675 and 2.147 s, respectively.

From Parkinson’s disease dataset the classification accuracy and variance are enhanced to become 95.000% and 0.00062, respectively using RF classifier. The features were reduced to 113.85 features which well enough for classification step with 1.134 s as a computational time. From BreastEw dataset the best computational time was after applying LR and SVM in contrast with the other optimizer. The RF gave the best variance and accuracy to become 0.000302 and 98%, respectively. The features and time were reduced to become 0.070 and 0.070 s, respectively.

The IFBS-RFS-RFE not achieves the best results in all datasets. The SVM classifier achieved the best classification accuracy and variance results from the RNA gene dataset with 99.988% and 0.0000002, respectively. The features and time were minimized to 125.25 features and 0.153 s, respectively. For other datasets it did not give good results.

The O/IFBS-RFS-RFE achieved the best results for dermatology erythemato-squamous diseases dataset. RF and Bagg classifiers gave the best results with 10 features. The classification accuracy, variance and time were improved to become 100%, 0.0 and 0.500, respectively. In addition, The O/IFBS-RFS-RFE achieved the best results in high dimension datasets using RNA gene. The LR classifier increased the accuracy and variance results to 99.994% and 0.0000004, respectively. From DNA CNV dataset, the Bagg classifier gave the best accuracy and variance results to become 92.834% and 0.00027, respectively. At the same time, the outer/inner position did not provide good results for other datasets.

For future work, our proposed method will apply the incremental feature selection (IFS) for different datasets using PFBS. The IFS will select the most relevant subset features to minimize the time when using all features and overcome the feature selection drawback.

Conclusions

In our study, new hybrid methods are proposed to enhance cancers classification performance using different size of datasets. The PFBS using EDF equation is enhanced the RFS and RFE performance. Many bootstrap positions are applied to improve the problem of over-fitting and to fix the feature selection problems. Furthermore, our proposed methods achieved high results using different size of datasets. It is compared with previous work and it gave high results.

Method

Dataset description

We used five healthcare datasets in the experiments. The DNA CNV dataset is used in [7, 8, 12] and downloaded from the cBioPortal for Cancer Genomics [59,60,61] to classify different types of human cancers. The other four datasets are downloaded from the UCI machine learning repository [62] and used in [9, 23]. A brief description of each adopted dataset is presented in Table 26.

Table 26 Datasets Description

The proposed hybrid feature selection methods

The main motivation of the proposed methods is to select the most important and relevant features from all original features. This step is considered vital and plays a significant role in obtaining good classification results. Non- influencing features waste time and lead to many complex problems included in poor classification accuracy, over-fitting, and feature size. The wrapper method for feature selection selects the features based on machine learning to find optimal features, but it takes more time to obtain these features and has chances of over-fitting problems. On the other hand, the advantage of embedded methods for feature selection is that the selected features are embedded in machine learning or during the model building process. It is applied to reduce the over-fitting problem, reducing the variance between features. Based on the advantages of the two previous methods, we proposed hybrid methods for feature selection to obtain the most relevant subset feature. The proposed methods are shown in Fig. 4. Resampling method with different positions is applied to minimize the over-fitting problem and maximize the classification accuracy. After the resampling step, the most important features are selected using RFS algorithm. The hybrid between resampling and RF algorithms are applied to solve many problems such as (1) time consuming when using RFE algorithm, (2) over-fitting problem, (3) the most relevant features, and (4) classification accuracy. The wrapper method is applied to select the most important features, therefore; reduce the datasets dimensional and maximizing the classification accuracy. The RFE using LR classification as an estimator is integrated with the previous features to achieve the desired goals.

Fig. 4
figure 4

Hybrid proposed methods for feature selection

First bootstrap step as a resampling method

A lot of high-dimensional datasets suffer from over-fitting problems and low classification accuracy. We apply the FBS step as a resampling method to avoid these problems. The bootstrap samples are drawn with replacement as the same size of the original data. Given the original datasets X = X1, X2, X3, ........, XO With O observations with a distribution function called empirical distribution function (EDF). The bootstrap sample is denoted as X* = X*1, X*2, X*3, ......., X*O. The (EDF) is denoted as follows [63]: -

$${\hat{F}}_O(t)=\sum_{I=1}^OI\left({X}_i\le t\right)/O$$
(5)

Where I(·) denotes the indicator function, the bootstrap resampling method is applied in many positions to achieve the desired task. The first position of bootstrap is before selecting the essential features called OFBS, but we need to apply different positions to obtain the best results. In this position the EDF is applied as a resampling method before selecting features. The IFBS is applied during selecting the feature selection. On the other hand, the O/IFBS is applied before and during selecting features. All bootstrap positions are applied to overcome the over-fitting and classification accuracy. After these positions, the classification accuracy and over-fitting problems are improved. Therefore, the proposed positions selected the most relevant features.

Feature selection using random Forest (RFS)

A random forest algorithm is applied for feature selection to improve the performance of the classifiers, reduce the over-fitting problem and time consuming due to the disadvantage of RFE algorithm. It is considered the embedded feature selection that interacts directly with classifiers and reduces the time complexity found in the wrapper method. The RFS algorithm can identify the importance of the feature. The training samples are created using bootstrap when applying IFBS method but using all datasets to create samples when applying OFBS to improve the over-fitting and classification accuracy. The trees are constructed with a specific size. Select M trees from the dataset to build the decision trees. Decision trees are constructed from the M trees and they are repeated B times. Construct the smallest subset of features F at each node and separate the best features for F by Gini importance scores. It is sorted the features according to their scores from smallest to largest. The features below the threshold will be eliminated.

Recursive feature elimination (RFE)

Selecting the most significant features is the main goal in the classification step. In this direction, we applied RFE algorithm to select the most important features therefore; reach to the chromosome which considered the most developing human cancers. RFE is an instance of backward feature elimination. The classifier estimator is trained on the initial set of features and these features are sorted according to their weights. The features with the smallest weights are removed because these features are not important during the classification process. The previous steps are repeated until the most relevant features are reached. RFE is applied with LR as an estimator. The classification accuracy is improved after applying the proposed method. The step size is proposed in the RFE method called recursive feature elimination with cross-validation (RFECV) to achieve the best results. The features are sorted according to their importance at each step, and the smallest ranked feature is deleted. The proposed methods are presented in Tables 27, 28 and 29 as follows:

Table 27 Algorithm 1 of the first hybrid proposed method using OFBS-RFS-RFE
Table 28 Algorithm 2 of the second hybrid proposed method using IFBS-RFS-RFE
Table 29 Algorithm 3 of the third hybrid proposed method using O/IFBS-RFS-RFE

Availability of data and materials

All datasets and details are available at request from the corresponding author and as a supplement to this article.

Abbreviations

RFE:

Recursive feature elimination

RFS:

Random forest for selection

PFBS:

Positions first bootstrap step

PFBS-RFS-RFE:

Positions first bootstrap step random forest selection recursive feature elimination

OFBS:

Outer first bootstrap step

IFBS:

Inner first bootstrap step

O/IFBS:

Outer/Inner first bootstrap step

MIFS:

Mutual information based feature

IGF:

Information gain based feature selection

CNV:

Copy Number Variation

LR:

Logistic regression

SVM:

Support vector machine

PCA:

Principal component analysis

CBF:

Correlation based feature

FCBF:

Fast correlation based feature selection

KNN:

K-nearest neighbors

SSA:

Salp swarm algorithm

CSSA:

Constant salp swarm algorithm

PSO:

Particle swarm optimization

GA:

Genetic algorithm

LLSVM:

Linear support vector machine

GBM-RFE:

Gradient boosting machines RFE

BPSO:

Binary particle swarm optimization

mRMR:

Minimum redundancy maximum relevance

IFS:

Incremental feature selection

SFLA:

Shuffled frog leaping algorithm

EDF:

Distribution function called empirical distribution function

RFECV:

Recursive feature elimination with cross-validation

ROC:

Receiver operating characteristic

PPV:

Positive predictive value

TP:

True positive

TN:

True negative

FN:

False-negative

FP:

False-positive

References

  1. Tran KA, Kondrashova O, Bradley A, Williams ED, Pearson JV, Waddell N. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Med. 2021;13(1):152. https://doi.org/10.1186/s13073-021-00968-x.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Bi WL, Hosny A, Schabath MB, Giger ML, Birkbak NJ, Mehrtash A, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin. 2019;69(2):127–57. https://doi.org/10.3322/caac.21552.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Fang H, Shi K, Wang X, Zuo C, Lan X. Artificial intelligence in positron emission tomography. Front Med (Lausanne). 2022;9:848336. https://doi.org/10.3389/fmed.2022.848336 PMID: 35174194; PMCID: PMC8841845.

    Article  Google Scholar 

  4. Alfayez AA, Kunz H, Lai AG. Predicting the risk of cancer in adults using supervised machine learning: a scoping review. BMJ Open. 2021;11(9). https://doi.org/10.1136/bmjopen-2020-047755 .

  5. Liew XY, Hameed N, Clos J. A review of computer-aided expert systems for breast cancer diagnosis. Cancers (Basel). 2021;13(11):2764. https://doi.org/10.3390/cancers13112764 PMID: 34199444; PMCID: PMC8199592.

    Article  PubMed Central  Google Scholar 

  6. Saini A, Kumar M, Bhatt S, Saini V, Malik A. Cancer causes and treatments. Int J Pharm Sci Res. 2020;11(7):3121–34. https://doi.org/10.13040/IJPSR.0975-8232.11(7).3121-34.

    Article  CAS  Google Scholar 

  7. Zhang N, Wang M, Zhang P, Huang T. Classification of cancers based on copy number variation landscapes. Biochimica et BiophysicaActa (BBA)-General Subjects. 2016;1860(11):2750–5. https://doi.org/10.1016/j.bbagen.2016.06.003.

    Article  CAS  Google Scholar 

  8. Elsadek SFA, Makhlouf MAA, El-Sayed BBST, Mohamed HNE. Hybrid feature selection using swarm and genetic optimization for DNA copy number variation. Int J Eng Res Technol. 2019;12(7):1110–6 http://www.irphouse.com.

    Google Scholar 

  9. Hegazy AhE, Makhlouf MA, El-Tawel GhS. Feature selection using chaotic salp swarm algorithm for data classification. Arab J Sci Eng. 2019;44(4):3801–16. https://doi.org/10.1007/s13369-018-3680-6.

    Article  Google Scholar 

  10. Ansari G, Ahmad T, Doja MN. Hybrid filter–wrapper feature selection method for sentiment classification. Arab J Sci Eng. 2019;44:9191–920. https://doi.org/10.1007/s13369-019-04064-6.

    Article  Google Scholar 

  11. Huljanah M, Rustam Z, Utama S, Siswantining T. Feature selection using random forest classifier for predicting prostate cancer. In: IOP Conference Series Materials Science and Engineering; 2019. p. 052031. https://doi.org/10.1088/1757-899X/546/5/052031.

    Chapter  Google Scholar 

  12. Elsadek SFA, Makhlouf MAA, Aldeen MA. Supervised classification of cancers based on copy number variation. In: Hassanien A, Tolba M, Shaalan K, Azar A, editors. Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018.AISI 2018. Advances in Intelligent Systems and Computing. Cham: Springer; 2019. p. 198–207. https://doi.org/10.1007/978-3-319-99010-118.

    Chapter  Google Scholar 

  13. Nair R, Bhagat A. Feature selection method to improve the accuracy of classification algorithm. Int J Innov Technol Explor Eng (IJITEE). 2019;8:124–7. https://doi.org/10.1016/j.csda.2018.05.015.

    Article  Google Scholar 

  14. Yildirim P. Filter based feature selection methods for prediction of risks in hepatitis disease. Int J Machine Learn Comput. 2015;5:258–63. https://doi.org/10.7763/IJMLC.2015.V5.517.

    Article  Google Scholar 

  15. Alirezanejad M, Enayatifar R, Motameni H, Nematzadeh H. Heuristic filter feature selection methods for medical datasets. Genomics. 2020;112(2):1173–81. https://doi.org/10.1016/j.ygeno.2019.07.002.

    Article  CAS  PubMed  Google Scholar 

  16. Kuswanto NRYH, Ohwada H. Comparison of feature selection methods to classify inhibitors in dud-e database. In: 3rd International Neural Network Society Conference on Big Data and Deep Learning, INNS BDDL 2018 - Sanur, Bali, Indonesia, vol. 144; 2018. p. 194–202. https://doi.org/10.1016/j.procs.2018.10.519.

    Chapter  Google Scholar 

  17. Ghasemi F, Neysiani BS, Nematbakhsh N. Feature selection in pre-diagnosis heart coronary artery disease detection: A heuristic approach for feature selection based on information gain ratio and gini index. In: 2020 6th International Conference on Web Research (ICWR); 2020. p. 27–32. https://doi.org/10.1109/ICWR49608.2020.9122285.

    Chapter  Google Scholar 

  18. Mahmood MR. Two feature selection methods comparison chi-square and relief-f for facial expression recognition. J Phys Conf Ser. 2021;1804(1):012056. https://doi.org/10.1088/1742-6596/1804/1/012056.

    Article  Google Scholar 

  19. Spencer R, Thabtah F, Abdelhamid N, Thompson M. Exploring feature selection and classification methods for predicting heart disease. Digital Health. 2020;6:2055207620914777. https://doi.org/10.1177/2055207620914777.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Mohamed R, Yusof MM, Wahidi N. A comparative study of feature selection techniques for bat algorithm in various applications. MATEC Web of Conferences. 2018;150:06006. https://doi.org/10.1051/matecconf/201815006006.

    Article  Google Scholar 

  21. Vikas K, P. Lung cancer detection using chi-square feature selection and support vector machine algorithm. Int J Adv Trends Comput Sci Eng (IJATCSE). 2021;10(3):2050–60. https://doi.org/10.30534/ijatcse/2021/80103202.

    Article  Google Scholar 

  22. Hegazy AhE, Makhlouf MA, El-Tawel GhS. Improved salp swarm algorithm for feature selection. J King Saud Univ Comput Inform Sci. 2020;10:1217. https://doi.org/10.1016/j.jksuci.2018.06.003.

    Article  Google Scholar 

  23. Li Z, Xie W, Liu T. Efficient feature selection and classification for microarray data. PLoS One. 2018;13(8):e0202167. https://doi.org/10.1371/journal.pone.0202167.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Jeon H, Oh S. Hybrid-recursive feature elimination for efficient feature selection. Appl Sci. 2020;10(9). https://doi.org/10.3390/app10093211.

  25. Rani P, Chawla SK, Gujral RK. A hybrid approach for feature selection based on genetic algorithm and recursive feature elimination. Int J Inform Syst Model Design. 2021;12(2). https://doi.org/10.4018/IJISMD.2021040102.

  26. Zvarevashe K, Kadebu P, Mukwazvure A, Mukora F, Gotora TT. Majority voting ensemble learning for intrusion detection using recursive feature elimination. In: Proceedings of the 2nd African International Conference on Industrial Engineering and Operations Management Harare, Zimbabwe; 2020.

    Google Scholar 

  27. Senan EM, Al-Adhaileh MH, Alsaade FW, Aldhyani THH, Alqarni AA, Alsharif N, et al. Diagnosis of chronic kidney disease using effective classification algorithms and recursive feature elimination techniques. J Healthcare Eng. 2021;2021. https://doi.org/10.1155/2021/1004767.

  28. Liu XY, Liang Y, Wang S, Yang ZY, Ye HS. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access. 2018;6. https://doi.org/10.1109/ACCESS.2018.2818682.

  29. Pirgazi J, Alimoradi M, Abharian TE, Olyaee MH. An efficient hybrid filter-wrapper metaheuristic-based gene selection method for high dimensional datasets. Sci Rep. 2019;9(1). https://doi.org/10.1038/s41598-019-54987-1.

  30. Mandal M, Singh PK, Ijaz MF, Shafi J, Sarkar R. A tri-stage wrapper-filter feature selection framework for disease classification. Sensors. 2021;21(16). https://doi.org/10.3390/s21165571.

  31. Venkatesh B, Anuradha J. A hybrid feature selection approach for handling a high-dimensional data. In: Innovations in Computer Science and Engineering Lecture Notes in Networks and Systems, vol. 74; 2019. p. 365–73. https://doi.org/10.1007/978-981-13-7082-342.

    Chapter  Google Scholar 

  32. Gakii C, Mireji PO, Rimiru R. Graph based feature selection for reduction of dimensionality in next-generation rna sequencing datasets. Algorithms. 2022;15(1):21. https://doi.org/10.3390/a15010021.

    Article  Google Scholar 

  33. Aruna KGL, Padmaja P, Jaya SG. Logistic regression and random forest-based hybrid classifier with recursive feature elimination technique for diabetes classification. Int J Adv Trends Comput Sci Eng. 2020;9(4):6796–804. https://doi.org/10.30534/ijatcse/2020/379942020.

    Article  Google Scholar 

  34. Venkatachalam K, Prabhu P, Balaji BS, Abouhawwash M, Rajadevi R. Recursive feature elimination with ridge regression (l2) machine learning hybrid feature selection algorithm for diabetic prediction using random forest classifer. Res Square. 2021;1. https://doi.org/10.21203/rs.3.rs-742641/v1.

  35. Andrews LJ. Addressing over-fitting and under-fitting in gaussian model-based clustering. Comput Stat Data Analysis. 2018;127:160–71. https://doi.org/10.1016/j.csda.2018.05.015.

    Article  Google Scholar 

  36. Garcia-Diaz P, Sanchez-Berriel I, Martinez-Rojas JA, Diez-Pascual MA. Unsupervised feature selection algorithm for multi-class cancer classification of gene expression rna-seq data. Genomics. 2020;112(2):1916–25. https://doi.org/10.1016/j.ygeno.2019.11.004.

    Article  CAS  PubMed  Google Scholar 

  37. Sakar CO, Serbes G, Gunduz A, Tunc CH, Nizam H, Sakar BE, et al. A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable q-factor wavelet transform. Appl Soft Comput J. 2019;74:255–63. https://doi.org/10.1016/j.asoc.2018.10.022.

    Article  Google Scholar 

  38. https://www.ncbi.nlm.nih.gov/gene/4146, Accessed 10 Oct 2021.

  39. Takakura S, Kohno T, Manda R, Okamoto A, Tanaka T, Yokota J. Genetic alterations and expression of the protein phosphatase 1 genes in human cancers. Int J Oncol. 2001;18(4):817–24. https://doi.org/10.3892/ijo.18.4.817 PMID: 11251179.

    Article  CAS  PubMed  Google Scholar 

  40. Beneventi G, Munita R, Ngoc PCT, Madej M, Ciesla M, Muthukumar S, et al. The small cajal body-specific rna 15 (scarna15) directs p53 and redox homeostasis via selective splicing in cancer cells. NAR Cancer. 2021;3(3):817–24. https://doi.org/10.1093/narcan/zcab026.

    Article  Google Scholar 

  41. Chen C, Juan C, Chen K, Chang Y, Lee J, Chang M. Upregulation of rpa2 promotes nf-b activation in breast cancer by relieving the antagonistic function of menin on nf-b-regulated transcription. Carcinogenesis. 2017;38(2):196–206. https://doi.org/10.1093/carcin/bgw123 PMID: 28007956.

    Article  CAS  PubMed  Google Scholar 

  42. Waldbillig F, Nitschke K, Abdelhadi A, von Hardenberg J, Nuhn P, Nientiedt M, et al. Phosphodiesterase smpdl3b gene expression as independent outcome prediction marker in localized prostate cancer. Int J Mol Sci. 2020;21(12):4373. https://doi.org/10.3390/ijms21124373.

    Article  CAS  PubMed Central  Google Scholar 

  43. https://www.proteinatlas.org/ENSG00000158156-XKR8 Accessed 10 Oct 2021.

  44. Havrysh KV, Bogdanov M, Nurgalieva AK, Kiyamova R. 381p - xkr8 is a promising potential prognostic marker in glioblastoma multiforme patients. Ann Oncol. 2019;30:128–30. https://doi.org/10.1093/annonc/mdz431.018.

    Article  Google Scholar 

  45. Cao F, Liu M, Zhang Q, Hao R. Phactr4 regulates proliferation, migration and invasion of human hepatocellular carcinoma by inhibiting il-6/stat3 pathway. Eur Rev Med Pharmacol Sci. 2016;20(16):3392–9.

    CAS  PubMed  Google Scholar 

  46. Qiao L, Zheng J, Tian Y, Zhang Q, Wang X, Chen JJ, et al. Regulator of chromatin condensation 1 abrogates the g1 cell cycle checkpoint via cdk1 in human papillomavirus e7-expressing epithelium and cervical cancer cells. Cell Death Dis. 2018;9(6):583. https://doi.org/10.1038/s41419-018-0584-z.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Chang L, Hu Z, Zhoua Z, Zhang H. Retracted article: Snhg3 promotes proliferation and invasion by regulating the mir-101/zeb1 axis in breast cancer. RSC Adv Royal Soc Chem. 2018;8:15229–40. https://doi.org/10.1039/C8RA02090F.

    Article  CAS  Google Scholar 

  48. Mourksi N, Morin C, Fenouil T, Diaz JJ, Marcel V. Snornas offer novel insight and promising perspectives for lung cancer understanding and management. Cells. 2020;9(3):541. https://doi.org/10.3390/cells9030541.

    Article  CAS  PubMed Central  Google Scholar 

  49. Zimta AA, Tigu AB, Braicu C, Stefan C, Ionescu C, Berindan-Neagoe I. An emerging class of long non-coding rna with oncogenic role arises from the snorna host genes. Front Oncol. 2020;10:389. https://doi.org/10.3389/fonc.2020.00389.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Xu Y, Milazzo JP, Somerville TDD, Tarumoto Y, Huang YH, Ostrander EL, et al. A tfiid-saga perturbation that targets myb and suppresses acute myeloid leukemia. Cancer Cell. 2018;33(1):13–28. https://doi.org/10.1016/j.ccell.2017.12.002.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S. Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets. Iran J Basic Med Sci. 2016;19(5):476–82.

    PubMed  PubMed Central  Google Scholar 

  52. Celebi ME, Kingravi HA, Iyatomi H, Aslandogan YA, Stoecker WV, Moss RH. Border detection in dermoscopy images using statistical region merging. Skin Res Technol. 2008;14(3):347–53. https://doi.org/10.1111/j.1600-0846.2008.00301.x PMID: 19159382; PMCID: PMC3160669.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Shrivastava KV, Londhe ND, Sonawane RS, Suri JS. Reliable and accurate psoriasis disease classification in dermatology images using comprehensive feature space in machine learning paradigm. Expert Syst Appl. 2015;42(15):6184–95. https://doi.org/10.1016/j.eswa.2015.03.014.

    Article  Google Scholar 

  54. Song J, Shea C. Benign versus malignant parakeratosis: a nuclear morphometry study. Mod Pathol. 2010;23:799–803. https://doi.org/10.1038/modpathol.2010.52.

    Article  PubMed  Google Scholar 

  55. Morais KL, Miyamoto D, Maruta CW, Aoki V. Diagnostic approach of eosinophilic spongiosis. An Bras Dermatol. 2019;94(6):724–8. https://doi.org/10.1016/j.abd.2019.02.002.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Sutarjono B, Lebovitch H. Psoriasiform spongiotic dermatitis. BMJ Case Reports CPl. 2019;12(3):228690. https://doi.org/10.1136/bcr-2018-228690.

    Article  Google Scholar 

  57. Song J, Xian D, Yang L, Xiong X, Lai R, Zhong J. Pruritus: Progress toward pathogenesis and treatment. BioMed Res Int. 2018;2018:9625936. https://doi.org/10.1155/2018/9625936.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Queiro R, Tejon P, Alonso S, Coto P. Age at disease onset: a key factor for understanding psoriatic disease. Rheumatology. 2014;53(7):1178–85. https://doi.org/10.1093/rheumatology/ket33.

    Article  CAS  PubMed  Google Scholar 

  59. Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cbio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2012;2:401–4. https://doi.org/10.1158/2159-8290.CD-12-0095.

    Article  PubMed  Google Scholar 

  60. Ciriello G, Miller ML, Aksoy BA, Senbabaoglu Y, Schultz N, Sander C. Emerging landscape of oncogenic signatures across human cancers. Nat Genet. 2013;45:1127–33. https://doi.org/10.1038/ng.2762.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cbioportal. Sci Signal. 2013;6(269):1. https://doi.org/10.1126/scisignal.2004088.

    Article  CAS  Google Scholar 

  62. UCI Machine Learning Repository: Data Sets. http://archive.ics.uci.edu/ml/index.php. Accessed 30 Apr 2021.

  63. Karlsson S, Lothgren M. Computationally efficient double bootstrap variance estimation. Comput Stat Data Anal. 2000;33(3):237–47. https://doi.org/10.1016/S0167-9473(99)00066-3.

    Article  Google Scholar 

Download references

Acknowledgements

Thanks to Dr. Mohamed for his help and support, thanks to Dr. Ghada for her support and guidance.

Funding

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB). There is no fund for the research.

Author information

Authors and Affiliations

Authors

Contributions

To fix the problems of feature selection and classification steps, PFBS-RFS-RFE is proposed. Many bootstrap positions are applied to achieve a good result and to enhance the RFE performance. The selected features are intersected after the number of run to know the associated genes of cancer. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Noura Mohammed Abdelwahed.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abdelwahed, N.M., El-Tawel, G.S. & Makhlouf, M.A. Effective hybrid feature selection using different bootstrap enhances cancers classification performance. BioData Mining 15, 24 (2022). https://doi.org/10.1186/s13040-022-00304-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13040-022-00304-y

Keywords