Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification
- Jinyan Li†1,
- Simon Fong†1,
- Yunsick Sung2,
- Kyungeun Cho3,
- Raymond Wong4 and
- Kelvin K. L. Wong5, 6Email author
© The Author(s). 2016
Received: 13 May 2016
Accepted: 21 November 2016
Published: 1 December 2016
An imbalanced dataset is defined as a training dataset that has imbalanced proportions of data in both interesting and uninteresting classes. Often in biomedical applications, samples from the stimulating class are rare in a population, such as medical anomalies, positive clinical tests, and particular diseases. Although the target samples in the primitive dataset are small in number, the induction of a classification model over such training data leads to poor prediction performance due to insufficient training from the minority class.
In this paper, we use a novel class-balancing method named adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique (ASCB_DmSMOTE) to solve this imbalanced dataset problem, which is common in biomedical applications. The proposed method combines under-sampling and over-sampling into a swarm optimisation algorithm. It adaptively selects suitable parameters for the rebalancing algorithm to find the best solution. Compared with the other versions of the SMOTE algorithm, significant improvements, which include higher accuracy and credibility, are observed with ASCB_DmSMOTE.
Our proposed method tactfully combines two rebalancing techniques together. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each clustered sub-imbalanced dataset. The proposed methods ultimately overcome other conventional methods and attains higher credibility with even greater accuracy of the classification model.
KeywordsImbalanced dataset Swarm optimisation Under-sampling SMOTE Dynamic Multi-objective Classification Biomedical data
Machine learning plays an important role in knowledge discovery and automatic recognition in biomedical applications. Specifically, classification is a machine learning technique that integrates the complex relationships between the input variables and the target classes of some biomedical data. Automatic pattern recognition and prediction are then possible with the learnt model when unseen data are tested. Machine learning from biomedical data encounters several difficulties, mainly because these datasets are characterised by incompleteness (missing values), incorrectness (collection error or noise in the data), inexactness (data retrieved from incorrect sources) and sparseness (too few records available). Another subtle problem that transcends the integrity of the data is an imbalanced class distribution; that is, there are too few target data that the users are interested in amongst too much ordinary data collected. For instance, some decision support systems in health care applications deal with patient data that include very few positive records in a large population, especially for new diseases. Other examples are cancer genes in microarrays , abnormal sub-sequences in biosignal patterns , tiny cysts in mammograms of the biological imaging field  and the colony distribution and mutation of E. coli or yeast [4, 5], in addition to classification in the biomedical engineering field , etc.
The imbalanced dataset problem is known to cause pseudo-accuracy – a spuriously good prediction rate with low credibility. A classification model that is learnt from a majority of mediocre data becomes biased towards the majority class and less sensitive to recognition of the minority class samples . Testing this classifier with the same training dataset shows a high prediction accuracy on the surface. However, when the model is tested with new unseen samples of the minority class, the accuracy rate plummets, which indicates that the falsely high accuracy of the training model is futile and unreliable.
The current approach of rebalancing the imbalanced dataset is to simply inflate the population of the minority class by randomly copying its data or shrinking the amount of majority class data until they match the population of the other class. This approach works by matching the populations of the classes merely in quantity. And when it comes to repeated experiment for more than ten times, this approach ignores the subtle underlying mappings between the input variables and the target classes, which can be highly nonlinear. In a nutshell, adjusting the quantity of data from each class to the same level does not guarantee generation of the most effective classifier. The methods used to attain a balanced dataset such as the aforementioned over-sampling  and under-sampling , both of which change the numbers of two classes’ samples. These methods are at the data level. Furthermore, there also exist some techniques at the algorithm level to overcome the imbalanced problem in classification. Cost-sensitive learning  is a commonly used method in which distinct weights are dispatched to the two classes to pressurise the classifier to the minority class. Boosting methods [11, 12] include many weak classifiers to obtain a strong classifier to avoid the imbalance problem. All combinations of class distributions were attempted with a support vector machine (SVM) as a performance measure .
In Equation (2), ω is the inertial weight; d = 1, 2, …, D; i = 1, 2, …, n; t is the current iteration time; c1 and c2 are non-negative constants as the velocity factor, r1 and r2 are random values between 0 to 1 and V id is the particle speed.
Our proposed approach introduces under-sampling and ensemble techniques to controllably cluster majority class samples into several sub-majority class datasets, which will respectively combine the original minority class dataset to generate the corresponding sub-datasets. The imbalanced sub-datasets will then make use of PSO to determine their suitable parameters for SMOTE for the over-sampling operation and finally obtain the average of their results. In addition to the accuracy of the classification model, the Kappa value is another objective used to assure the robustness and credibility in our experiment. Therefore, during the process of searching for our approach, we also solve a dynamic multi-objective problem. Compared with other methods, the proposed methods could combine different previous skills together and attain leap ascension of the classification credibility under the premise of maintaining high accuracy.
In classification, especially classification of a flawed dataset, the only indicator of accuracy is not persuasive. Even though it may be sharp, the results will still lead to misleading judgments and testing. The supplementary parameters used to measure and distinguish the classification model of imbalanced datasets are receiver operating characteristic area , F-measure (abbreviated as F-1)  and G-mean . In this paper, we collect the F-measure and G-mean as our reference parameters. The Kappa statistic  is another favourable assessment index used to effectively estimate the credibility of the classification model. In the imbalanced dataset’s classification, the low Kappa value accompanied a high level of accuracy because most classification algorithms neglected the minority class samples and misclassified them in the majority class. The target class commonly takes a very small percentage in quantity; thus the number of misclassified minority class samples produces a low error rate. As a result, the precision of the trained model will encounter a serious crisis of confidence when it meets multiple target class samples in a testing dataset. However, the low Kappa statistic will directly present the credibility of the classification.
Note that TP, TN, FP, and FN, respectively represent true positive, true negative, false positive and false negative. P stands for positive and N for negative. Po and Pc are the measures of the percentage of agreement and the chance of agreement respectively. Neural Network is used to estimate and verify the fitness of each iteration of the PSO. Figure 3 presents a snapshot of the fluctuation patterns of accuracy and kappa as the transformation progress (from TP = 0, TN = 0, FP = 100, FN = 5 to TP = 100, TN = 5, FP = 0, FN = 0) of a confusion matrix in an imbalanced dataset classification model, G-mean and F-measure as the auxiliary metrics. In this example, there are 100 majority class samples and 5 minority class samples. At the 606th cycle of iterations, accuracy and Kappa both have reached a very high value of approximately 1. Since the two objectives are not opposing each other, a special type of optimization called the non-inferior set tactics  is adopted here and customized for this specific rebalancing task. Furthermore, it shows Kappa is more sensitive than the commonly used metrics of G-mean and F-measure to judge the bias of the imbalanced classification model from the confusion matrix.
The classification results are evaluated by different training and testing parts. We perform a tenfold cross validation [25, 26] to test the corresponding performance of the current dataset classification model. That means the dataset randomly is divided into ten parts averagely, and each part will take turns being the testing dataset with the other nine parts as training datasets in the repeated ten times’ classifications. The Kappa, Accuracy, G-mean and other performances of this cross-validation process are averaged from these ten classifications. Moreover, to keep the fairness of the experiment, each dataset tested Random SMOTE, SRA and proposed methods separately ten times, and the final results pertain to the mean value of the experiments.
In the first part, the dataset is divided into majority class data and minority class data, which will be processed respectively. PSO optimized k-means clusters algorithm  the majority class into several categories as a strategy, and each sub-majority class dataset is combined with the minority class dataset to establish the corresponding sub-dataset, which will be reprocessed by the second part to perform over-sampling separately. The k-means algorithm is a widely used algorithm for cluster in data mining . It randomly select k instances as the center of k classes, and according to the Euclidean distance, the rest instances will be respectively assigned to the closest class. Then this process will be repeated until the sum of squared error of the centre converge. Thus the initial defended value of k and the center of classes for k-means directly impact the cluster effect. PSO has strong global searching ability which helps k-means to avoid falling local-best. Since internal information sharing between particles in the population in each iteration, the results converge rapidly and steadily. The fitness function of PSO optimized k-means cluster algorithm adopts the Euclidean distance as its fitness function to find out the appropriate center of the classes. Moreover, compared with the previous methods [29, 30], in order to find the global best solution in this step, there are two termination conditions assisting PSO to obtain a reasonable k value of k-means (where k is the number of clusters). The first condition is that the number of clusters must be greater than one, and the second is that the minimum value of Kappa in all of the classification results of the numerous sub-datasets must be greater than 0.2. Here, the classifier still implements the Neural Network. Therefore, PSO can assist k-means adaptively find out the proper centre of classes and the value of K to overcome weakness of the traditional k-means algorithm. Furthermore, in a new sub-dataset, if the original minority class samples overcome the other class samples in quantity, Neural Network will directly classify this sub-dataset. Otherwise this dataset will perform the over-sampling operation.
Direct Neural Network, SMOTE + Neural Network, Random-SMOTE , and our forgoing version of SRA (PSO SMOTE) + Neural Network constitute the comparison benchmark. We generate the completely balanced dataset with SMOTE. Random-SMOTE is used to randomly pick out parameters for SMOTE to generate a new dataset, using the average of ten times of Random-SMOTE as the ultimate result of each dataset. PSO SMOTE (SRA) has two update conditions, based on the requirements that Kappa will be greater than the fixed Kappa threshold (0.4), and that the accuracy must be greater than that of the previous position. In the experiments, the populations of the PSO and the maximum iteration are 20 and 100, respectively. Note that Matlab (version 2014) has been utilized to code and compile the whole program, and the operating and computing environment for all experiments is in the workstation with CPU: CPU: E5-1650 V2 @ 3.50 GHz, RAM: 32 GB.
Information of our biomedical datasets
class = 06
Results and discussion
Average Kappa of different algorithms with different datasets
0.312 ± 0.48
0.670 ± 0.21
0.813 ± 0.11
0.723 ± 0.12
0.850 ± 0.05
0.848 ± 0.06
0.688 ± 0.13
0.824 ± 0.07
0.874 ± 0.05
0.578 ± 0.23
0.968 ± 0.02
0.927 ± 0.04
0.762 ± 0.12
0.906 ± 0.06
0.829 ± 0.11
0.826 ± 0.13
0.937 ± 0.04
0.966 ± 0.02
0.729 ± 0.14
0.673 ± 0.16
0.932 ± 0.02
0.660 ± 0.22
0.833 ± 0.09
0.884 ± 0.06
Average Accuracy of different algorithms with different datasets
0.686 ± 0.23
0.895 ± 0.05
0.902 ± 0.03
0.817 ± 0.18
0.959 ± 0.03
0.918 ± 0.02
0.781 ± 0.19
0.952 ± 0.03
0.927 ± 0.02
0.756 ± 0.18
0.968 ± 0.04
0.959 ± 0.02
0.852 ± 0.13
0.961 ± 0.03
0.946 ± 0.03
0.871 ± 0.11
0.958 ± 0.04
0.961 ± 0.03
0.884 ± 0.10
0.960 ± 0.03
0.956 ± 0.02
0.807 ± 0.16
0.950 ± 0.03
0.938 ± 0.02
Average G-mean value of different algorithms with different dataset
0.479 ± 0.22
0.715 ± 0.12
0.843 ± 0.05
0.768 ± 0.18
0.813 ± 0.12
0.875 ± 0.04
0.750 ± 0.15
0.832 ± 0.10
0.916 ± 0.04
0.641 ± 0.14
0.926 ± 0.07
0.928 ± 0.05
0.811 ± 0.14
0.898 ± 0.05
0.836 ± 0.6
0.802 ± 0.13
0.904 ± 0.06
0.951 ± 0.4
0.795 ± 0.12
0.746 ± 0.06
0.926 ± 0.5
0.721 ± 0.14
0.833 ± 0.10
0.896 ± 0.05
Average F-measure of different algorithms with different datasets
0.643 ± 0.27
0.642 ± 0.1
0.865 ± 0.04
0.762 ± 0.18
0.795 ± 0.09
0.874 ± 0.04
0.793 ± 0.16
0.787 ± 0.07
0.891 ± 0.05
0.809 ± 0.15
0.902 ± 0.09
0.939 ± 0.04
0.820 ± 0.15
0.863 ± 0.08
0.821 ± 0.03
0.847 ± 0.16
0.895 ± 0.08
0.952 ± 0.04
0.812 ± 0.13
0.726 ± 0.07
0.943 ± 0.03
0.784 ± 0.17
0.801 ± 0.09
0.912 ± 0.04
Average Imbalanced ratio (majority: minority) value of different algorithms with different datasets
1.2 ± 0.7:1
0.7 ± 0.4:1
0.5 ± 0.3:1
1.3 ± 0.5:1
0.6 ± 0.2:1
0.4 ± 0.3:1
1.8 ± 0.5:1
1.1 ± 0.3:1
0.7 ± 0.4:1
1.9 ± 0.6:1
0.6 ± 0.2:1
0.7 ± 0.2:1
1.6 ± 0.4:1
0.8 ± 0.3:1
0.9 ± 0.1:1
1.3 ± 0.7:1
0.7 ± 0.3:1
0.5 ± 0.2:1
1.5 ± 0.5:1
0.9 ± 0.2:1
0.8 ± 0.3:1
1.5 ± 0.6:1
0.8 ± 0.3:1
0.6 ± 0.3:1
List of abbreviations
Adaptive Swarm Cluster-Based Dynamic Multi-objective SMOTE
Swarm Dynamic Multi-objective Rebalancing Algorithm
Majority class/Minority class
Particle Swarm Optimization
Synthetic Minority Oversampling Technique
Swarm Rebalancing Algorithm
As mentioned above, we created and introduced an index called reliable accuracy which was the product of Kappa and accuracy. Kappa represents the degree of the classification model’s agreement, reliability and credibility; thus we can connect these two indicators to assess the accuracy in truth. In addition, this is also a strategy of decision making to select a suitable pair of solutions from the non-inferior set. Figure 5 presents the average kappa, accuracy and reliable accuracy of each method. The results of the line diagram agree with those of the above discussions about the two box plots. Through the radar chart of Fig. 6, we compare the three commonly used auxiliary evaluation fingers. In our experiment, F-measure (F1) almost lost its effect. We note that G-mean and Kappa have nearly the same consistent variation even though Kappa is more sensitive and cautious.
The last bar diagram of Fig. 7 reveals the variations of the minority class data from the majority class data. With reference from Table 6, we find that our methods synthesise many minority class samples, even the number of minority class is more than the number of majority class in the new dataset, which renders our methods to require more time for processing, as shown in Fig. 6. However, their performance is better, and this also illustrate that the absolute equilibrium distribution of classes does not pertain to the best results.
In this paper, our proposed approach, ASCB_DmSMOTE, can overcome the imbalanced dataset problems in biomedical classification. It reasonably re-allocates the majority class in the details and dynamically optimises the two parameters of SMOTE to synthesise a reasonable scale of minority class for each sub-dataset and ultimately attains higher credibility of the classification model and even greater accuracy. This algorithm is a new version of SMOTE, and through the swarm intelligence algorithm, our swarm rebalancing series of algorithm can effectively combine the over-sampling, under-sampling and ensemble techniques. In addition to such a combination of methods, they can also be used with the population’s path to consecutively determine the best and most reasonable global solution. The new concept of reliable accuracy not only deals with decision making but also can be more direct and valid to evaluate a classification model. Its performances are much steadier than those of the previous version of our algorithms. Furthermore, it is able to more scientifically and effectively generate better and more reasonable synthetic data than the traditional class rebalancing algorithm. This work offers insights to biomedical practitioners who consider the application of computational tools to subside the imbalanced dataset problem, which is typically inherently in biomedical data.
The authors are thankful for the financial support from University of Macau, FST and RDAO.
The financial support from the research grant ‘Temporal Data Stream Mining by Using Incrementally Optimised Very Fast Decision Forest (iOVFDF)’, Grant no. MYRG2015-00128-FST, which is offered by the University of Macau, FST and RDAO, is gratefully acknowledged.
Availability of data and materials
SF and KKLW proposed the framework of this paper and gave the directions for all experiments. JL designed and implemented the methods, as well as performed the experiments. SF and KKLW analyzed and confirmed the validity of the experiments. JL, KKLW, and SF interpreted the results and drafted the manuscript. YS, KC, and RW reviewed the paper and gave reasonable comments for improvements. All authors have read and approved the manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Kamal AHM, et al. The impact of gene selection on imbalanced microarray expression data, Bioinformatics and Computational Biology. Berlin Heidelberg: Springer; 2009. p. 259–69.Google Scholar
- Dobrev D, Neycheva T, Mudrov N. Simple two-electrode biosignal amplifier. Med Biol Eng Comput. 2005;43(6):725–30.View ArticlePubMedGoogle Scholar
- Reiner BI. Medical imaging data reconciliation, Part 3: Reconciliation of historical and current radiology report data. J Am Coll Radiol. 2011;8(11):768–71.View ArticlePubMedGoogle Scholar
- Mandel LR, Borek E. The nature of the RNA synthesized during conditions of unbalanced growth in E. coli K12W-6*. Biochemistry. 1963;2(3):560–6.
- Glassner BJ, et al. Generation of a strong mutator phenotype in yeast by imbalanced base excision repair. Proc Natl Acad Sci. 1998;95(17):9997–10002.View ArticlePubMedPubMed CentralGoogle Scholar
- Kusiak A, Kernstine KH, Kern JA, McLaughlin KA, Tseng TL. Data Mining: Medical and Engineering Case Studies. Cleveland: Industrial Engineering Research Conference; 2000. p. 1–7.Google Scholar
- Fernández-Navarro F, Hervás-Martínez C, Gutiérrez PA. A dynamic over-sampling procedure based on sensitivity for multi-class problems. Pattern Recogn. 2011;44(8):1821–33.View ArticleGoogle Scholar
- Fawcett T, Provost FJ. Combining Data Mining and Machine Learning for Effective User Profiling. KDD. 1996.
- He H, Garcia EA. Learning from imbalanced data. Knowledge Data Eng, IEEE Trans. 2009;21.9:1263–84.Google Scholar
- Thai-Nghe, Nguyen, Zeno Gantner, and Lars Schmidt-Thieme. Cost-sensitive learning methods for imbalanced data. Neural Networks (IJCNN), The 2010 International Joint Conference on. IEEE, 2010.
- Joshi MV, Kumar V, Agarwal RC. Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001. pp. 257–264.
- Guo H, Viktor HL. Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach. ACM SIGKDD Explorations Newsletter. 2004;6.1:30–9.View ArticleGoogle Scholar
- Akbani R, Kwek S, Japkowicz N. Applying support vector machines to imbalanced datasets, Machine Learning: ECML 2004. Berlin Heidelberg: Springer; 2004. p. 39–50.Google Scholar
- Chawla NV, et al. SMOTE: synthetic minority over-sampling technique. J Artificial Intelligence Res. 2002:321–357.
- Li J, Fong S, Zhuang Y. Optimizing SMOTE by Metaheuristics with Neural Network and Decision Tree. Computational and Business Intelligence (ISCBI), 2015 3rd International Symposium on. IEEE, 2015.
- Kennedy J. Particle swarm optimization. In Encyclopedia of Machine Learning, Springer US; 2010. pp. 760–766.
- Marzban C. The ROC curve and the area under it as performance measures. Weather Forecast. 2004;19(6):1106–14.View ArticleGoogle Scholar
- Mani I, Zhang I. KNN approach to unbalanced data distributions: a case study involving information extraction, Proceedings of Workshop on Learning from Imbalanced Datasets. 2003.
- Tang Y, et al. SVMs modeling for highly imbalanced classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.1 (2009): 281–288.
- Viera AJ, Garrett JM. Understanding interobserver agreement: the Kappa statistic. Fam Med. 2005;37.5:360–3.Google Scholar
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.
- Li J , et al. Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms. J Supercomputing. 2016,72(10):3708–28.
- Li J, et al. Solving the under-fitting problem for decision tree algorithms by incremental swarm optimization in rare-event healthcare classification. JMed Imaging Health Inform. 2016;6(4):1102–10.
- Fonseca CM, Fleming PJ. Genetic algorithms for multiobjective optimization: formulation discussion and generalization, ICGA, vol. 93. 1993.
- Li J, et al. Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification. Advanced Data Mining and Applications: 12th International Conference, ADMA 2016. Gold Coast: Proceedings. Springer International Publishing, 2016.
- van der Gaag M, et al. The five-factor model of the Positive and Negative Syndrome Scale II: a ten-fold cross-validation of a revised model. Schizophr Res. 2006;85.1:280–7.Google Scholar
- van der Merwe DW, Engelbrecht AP. Data clustering using particle swarm optimization. Evolutionary Computation, 2003. CEC’03. The 2003 Congress on. Vol. 1. IEEE, 2003
- Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc: Ser C: Appl Stat. 1979;28.1:100–8.Google Scholar
- Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM Sigkdd Explorations Newsletter. 2004;6.1:40–9.View ArticleGoogle Scholar
- Yen SJ, Lee YS. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Syst Applications. 2009;36(3):5718–27.View ArticleGoogle Scholar
- Han H, Wang WY, Mao BH. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, Advances in Intelligent Computing. Berlin Heidelberg: Springer; 2005. p. 878–87.Google Scholar
- Ding Z. Diversified ensemble classifiers for highly imbalanced data learning and their application in bioinformatics. 2011.Google Scholar
- Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml/datasets.html]. Irvine, CA: University of California, School of Information and Computer Science. Accessed 1 Apr 2016.