- Open Access
- Open Peer Review
This article has Open Peer Review reports available.
LVQ-SMOTE – Learning Vector Quantization based Synthetic Minority Over–sampling Technique for biomedical data
© Nakamura et al.; licensee BioMed Central Ltd. 2013
Received: 18 March 2013
Accepted: 24 September 2013
Published: 2 October 2013
Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.
Experiments on eight real-world imbalanced datasets demonstrate that our proposed over-sampling method performs better than the simplest SMOTE on four of five standard classification algorithms. Moreover, it is seen that the performance of our method increases if the latest SMOTE called MWMOTE is used in our algorithm. Experiments on datasets for β-turn types prediction show some important patterns that have not been seen in previous analyses.
The proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. Besides, the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods.
With the arrival of big data society, the number of imbalanced biomedical data has increased, such as microRNA gene prediction  and detection of non-coding RNA . Classification of imbalanced biomedical data has been one of the major issues in Bioinformatics. The common understanding of imbalanced data in the community is that the majority samples outnumber the minority samples . The main problem of class imbalances is that most standard classification algorithms show poor classification performance because they assume or expect balanced class distributions.
Approaches to the class imbalance problem are broadly distinguished into two ways: one is “classification level” and another is “data level”. The classification level aims at adjusting the induction rules that describe the minority concepts which are often weaker than those of the majority concepts. One of the major approaches in the classification level is boosting . The idea of boosting is to increases weights of misclassified samples and reduce the bias of class-imbalance learning. Another approach in the classification level is tree-based learning such as C4.5  and Random Forest . For example, the Random Forest classifier creates many of the minority concepts to avoid the biased learning.
The data level is the modification of an imbalanced dataset to obtain a balanced distribution. There are two major methods in the data level, namely over-sampling and under-sampling. The over-sampling method increases the samples in the minority class, while the under-sampling method decreases the samples in the majority class. Both of the methods aim at achieving a well-balanced class distribution. In general, the under-sampling method is used to reduce the learning time of a classification algorithm when the data size is larger enough to represent characteristics of the data, while the over-sampling method is used to increase the performance of a classification algorithm. Since approaches in the data level are independent from classification algorithms, approaches in the data level are more flexible than those in the classification level.
SMOTE (Synthetic Minority Over-sampling Technique)  is a powerful over-sampling method that has shown a great deal of success in class imbalanced problems. The SMOTE algorithm calculates a distance of the feature space between minority examples and creates synthetic data along the line between a minority example and its selected nearest neighbor. Han et al. developed a modified SMOTE called borderline-SMOTE . The concept of their method is to generate synthetic samples near class boundaries. Their algorithms are specifically effective towards binary class problems with two features. However, since biomedical data such as gene expression data are often complex, they contain even thousands of features. Chen et al. presented an adaptive synthetic data generation called a RAMO technique . They have shown in their experiments that the technique of an adapting boosting often increases the performance of the simplest SMOTE. Barua et al. developed a novel over-sampling method called MWMOTE , which generates synthetic samples in clusters of informative minority class samples. From their experiments, it is seen that MWMOTE outperforms RAMO and SMOTE on various benchmark datasets including biomedical data.
The existing over-sampling methods based on SMOTE achieve slightly better or sometimes worse result than the simplest SMOTE. One of the reasons is that even when an existing SMOTE is successfully applied to a biomedical dataset, its empty feature space is still so huge that it is difficult for classification algorithms to estimate proper borderlines between classes. As a solution to the problem, this paper presents a novel over-sampling method using codebooks obtained by LVQ (Learning Vector Quantization) . The proposed method generates synthetic samples to occupy more feature space than the existing SMOTE algorithms.
Learning Vector Quantization
There are various modified versions of LVQ developed by Kohonen, namely LVQ2.1, LVQ3, OLVQ3, Multiple-pass LVQ, Hierarchical LVQ . Each of the algorithms is differ in how to determine the position of each codebook.
The proposed over-sampling method
As described in the previous section, the codebooks for each feature in a target dataset are used to determine the class of an unknown sample. Hence, if the codebooks in the target dataset is similar to those in a reference dataset, it is expected that the samples in the reference dataset would provide the target dataset with informative data for its classification problem. From the idea, this paper presents a method of generating synthetic samples using real samples taken from reference datasets according to a similarity measure of codebooks.
Here, we consider the case that T 1 is linked to R 1. Figure 4 shows an example of synthetic samples generated by our proposed method. As the figure shows, the samples in R 1 is added to T 1. If the dataset T has more than 3 features, the proposed method determines the numerical values for each of the other features by the following algorithm.
Find the nearest sample for each of the generated synthetic samples according to Maharanobis distance.
The numerical values for each of the other features in the nearest sample are copied to those of the other features in the generated synthetic sample.
The procedures above are conducted for all the set of two features in the training dataset, namely from T 1 to T nc . Finally, the SMOTE algorithm applies to T to obtain balanced class distribution.
Results and discussion
Benchmark datasets used for our experiments
0.35 : 0.65
0.23 : 0.77
0.35 : 0.65
0.36 : 0.64
0.34 : 0.66
0.35 : 0.65
0.097 : 0.903
0.034 : 0.966
Moreover, we performed β-turn types prediction on BT547 and BT823 dataset . β-turns are classified into nine types based on the dihedral angles of the two center residues in the turn . In this paper, we aim at improving prediction accuracy for DEBUT, which is one of the state-of-the-art methods for predicting β-turn types . We obtained the datasets used for training and testing DEBT that are available online at http://comp.chem.nottingham.ac.uk/debt/.
Parameter configuration for the proposed over-sampling method
As shown in Figure 1, the normalization and a feature selection method are executed in the proposed method. In our experiments, the normalization applied to change the range of feature values from 0 to 1 in the real number. And then, the principal component analysis, as the feature selection method, extracted 10 useful features according to the component scores in ascending order.
As the parameter of SMOTE techniques in the following section, five nearest neighbors were selected in their sample replacement. We selected Optimized Learning Vector Quantization 3 (OLVQ3) as a algorithm of LVQ, where the number of codebooks was configured with two.
In order to demonstrate the versatility of our proposed method, we selected widely used basic classification algorithms, namely SVM (Support Vector Machine) , Logistic Tree , Neural Network , Naive Bayes , Random Forest , and OLVQ3. SVM was implemented using a package called LIBSVM , where all the parameters were set as default and Radial Basis Kernel was selected as the kernel. SVM is a powerful classification algorithm for two-class classification. The other algorithms were implemented using weka 3-7-9 package . In the parameter configuration for these algorithms, since we aim at evaluating our over-sampling method, we focused on configuring them for gaining general performances, rather than optimizing them. After some preliminary runs, the number of trees in Random Forest was set as 200 and the number of codebooks in OLVQ3 was set as 600 to increase the performance of RF and OLVQ3, respectively, and all the other parameters were remained as default. In Weka 3-7-9, the default number of trees in RF is configured with 10, and we found 10 trees were insufficient to deal with several thousands of features in pre-experiments. Similarly, we increased the number of codebooks in OLVQ3 from the default value 20.
Classification results on the eight imbalanced datasets
were TP is the number of true positives (correctly identified as sick), FP is the number of false positives (incorrectly identified as sick), and TN is the number of true negatives (correctly identified as healthy).
Average G-mean for three cases
Nothing : base line
Sensitivity, Specificity, and G-mean for each of the datasets
G-mean for our proposed method (LVQ-SMOTE) in case MWMOTE instead of SMOTE is used in our algorithm
Our proposed method
β-turn types prediction
Results of β -turns prediction on the BT547 and BT823 dataset
DEBT + our method
β -turn type
Comparison of MCC scores between DEBT + our method, DEBT, and another β -turn type prediction method
This paper has presented a new over-sampling method using codebooks obtained by Learning Vector Quantization. In general, even when an existing SMOTE is applied to a biomedical dataset, it is still difficult to estimate proper borderlines between classes. In order to tackle this problem, we have proposed to generate synthetic samples using codebooks obtained by the learning vector quantization. The experimental results on eight real-world benchmark datasets have shown that the proposed over-sampling method generates useful synthetic samples for the classification of imbalanced biomedical data. It is expected that the proposed over-sampling method is basically compatible with basic classification algorithms and the existing over-sampling methods. In addition, experiments on datasets for β-turn types prediction show our proposed method has improved prediction of β-turns type IV and VIII.
In the future work, we plan to analyze benchmark datasets for extracting more effective codebooks. Moreover, we would like to improve the proposed algorithm regarding the generation of synthetic samples.
- Batuwita R, palade V: MicroPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics. 2009, 25 (8): 989-995. 10.1093/bioinformatics/btp107.View ArticlePubMedGoogle Scholar
- Yu C, Chou L, Chang D: Predicting protein-protein interactions in unbalanced data using the primary structure of proteins. BMC Bioinformatics. 2010, 11 (167): 1-10.View ArticleGoogle Scholar
- Haibo H: Learning from imbalanced data. IEEE Trans Knowledge Data Eng. 2009, 21 (9): 1263-1284.View ArticleGoogle Scholar
- Freund Y: Boosting a weak learning algorithm by majority. Inform Comput. 1995, 121 (2): 256-285. 10.1006/inco.1995.1136.View ArticleGoogle Scholar
- Quinlan R: C4.5: Proggrams for Machine Learning. 1993, San Francisco: Morgan Kaufmann PublishersGoogle Scholar
- Breiman L: Random forests. Mach Learn. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Chawla N, Bowyer K, Hall L, Kegelmeyer W: SMOTE: synthetic minority over-sampling technique. J Art Intell Res. 2002, 16: 321-357.Google Scholar
- Han H, Wang WY, Mao BH: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Proc of the 2005 International Conference on Advances in Intelligent Computing. 2005, Hefei: Springer, 878-887.Google Scholar
- Shen S, He H, Garcia E: RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw. 2010, 21 (10): 1624-1642.View ArticleGoogle Scholar
- Baura S, Islam M, Yao X, Murase K: MWMOTE – majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowledge Data Eng. 2012 (PrePrint), doi:10.1109/TKDE.2012.232Google Scholar
- Kohonen T: Learning vector quantization. The Handbook of Brain Theory and Neural Networks. 1995, Cambridge: MIT Press, 537-540.Google Scholar
- Frank A, Asuncion A: UCI Machine Learning Repository. 2010, Irvine,http://archive.ics.uci.edu/ml/,Google Scholar
- Kohonen T: LVQ PAK: The Learning Vector Quantization Program Package. 1996,http://www.cis.hut.fi/research/lvq_pak/,Google Scholar
- Alon U, Barkai N, Notterman D, Gish K, Barra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96: 6745-6750. 10.1073/pnas.96.12.6745.View ArticlePubMedPubMed CentralGoogle Scholar
- Golub T: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Fuchs P, Alix A: High accuracy prediction of beta-turns and their types using propensities and multiple alignments. Proteins. 2005, 59 (4): 828-839. 10.1002/prot.20461.View ArticlePubMedGoogle Scholar
- Hutchinson E, Thornton J: A revised set of potentials for beta-turn formation in proteins. Protein Sci. 1994, 3 (12): 2207-2216. 10.1002/pro.5560031206.View ArticlePubMedPubMed CentralGoogle Scholar
- Kountouris P, Hirst J: Predicting β -turns and their types using predicted backbone dihedral angles and secondary structures. BMC Bioinformatics. 2010, 11 (407): 1-11.Google Scholar
- Cortes C, Vapnik V: Support-vector networks. Mach Learn. 1995, 20 (3): 273-297.Google Scholar
- Marc S, Eibe F, Mark H: Speeding up logistic model tree induction. Proc of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. 2005, Porto: Springer, 675-683.Google Scholar
- Rumelhart D, Hinton G, Williams R: Learning Internal Representations by Error Propagation, Volume 1. 1986, Cambridge: MIT PressGoogle Scholar
- George H, Pat L: Etimating continuous distributions in bayesian classifiers. Proc of the Eleventh Conference on Uncertainty in Artificial Intelligence. 1995, San Francisco: Morgan Kaufmann Publishers Inc., 338-345.Google Scholar
- Chang C, Lin J: LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol. 2011, 2 (27): 531-537.Google Scholar
- Mark H, Eibe F, Geoffrey H, Bernhard P, Peter R, Ian H: Weka 3: data mining software in Java. ACM SIGKDD Explorations Newsletter; 2009. Machine Learning Group at the University of waikato. http://www.cs.waikato.ac.nz/ml/weka/
- Yaov F, Robert E: A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1995, 55: 119-139.Google Scholar
- Shi X, Hu X, Li S, Liu X: Prediction of β-turn types in protein by using composite vector. J Theor Biol. 2011, 286 (1): 24-30.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.