Skip to main content

Iteratively refining breast cancer intrinsic subtypes in the METABRIC dataset



Multi-gene lists and single sample predictor models have been currently used to reduce the multidimensional complexity of breast cancers, and to identify intrinsic subtypes. The perceived inability of some models to deal with the challenges of processing high-dimensional data, however, limits the accurate characterisation of these subtypes. Towards the development of robust strategies, we designed an iterative approach to consistently discriminate intrinsic subtypes and improve class prediction in the METABRIC dataset.


In this study, we employed the CM1 score to identify the most discriminative probes for each group, and an ensemble learning technique to assess the ability of these probes on assigning subtype labels using 24 different classifiers. Our analysis is comprised of an iterative computation of these methods and statistical measures performed on a set of over 2000 samples. The refined labels assigned using this iterative approach revealed to be more consistent and in better agreement with clinicopathological markers and patients’ overall survival than those originally provided by the PAM50 method.


The assignment of intrinsic subtypes has a significant impact in translational research for both understanding and managing breast cancer. The refined labelling, therefore, provides more accurate and reliable information by improving the source of fundamental science prior to clinical applications in medicine.

Peer Review reports


Translational research aims at bringing basic scientific discoveries into outcomes that help improve clinical decision-making. The PAM50 Breast Cancer Intrinsic Classifier [1] has lately been used to assign the molecular subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like [25]) based on shrunken centroids of gene expression profiles [6]. It uses a Single Sample Predictor (SSP) model with an embedded 50-gene assay. In spite of the relevance of this method for clinical management, there are limited investigations in the literature that support this classification approach. Comparison with other methods showed only moderate agreement between subtype labels assigned, as well as independent clinical prognostic information [79].

Other multi-gene signatures have also been reported within the molecular patterns strongly correlated to clinical prognosis [10, 11], disease progression [12, 13], and patient survival [14]. Different methods, however, highlight a variety of gene lists of distinct size due the analysis of diverse microarray data and platform technologies. Additionally, the methods currently applied bring a pragmatic concern of using SSP models for predicting disease subtypes. Multiple classifiers or ensemble learning model, on the other hand, have compensated for poor learning algorithms by performing extra computation [15]. Therefore, there is an urgent need for translating these novel strategy to provide more accurate predictions of clinicopathological outcome.

In 2012, The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) [16] disclosed a rich gene expression cohort widely used for investigating breast cancer diseases. In spite of the quality of this dataset, there are some inconsistencies with regards to the subtype labels assigned in the original cohort [17]. In our previous study [17], a thorough review of the intrinsic subtypes was suggested and is, therefore, mandatory given the importance of this dataset to breast cancer research. For this report, we then propose a more robust approach to iteratively refine the labels in the METABRIC dataset based on ensemble learning. The new labels are yet correlated to well-established clinicopathological markers and patient overall survival.


Transcriptomic datasets

The breast cancer dataset disclosed by the METABRIC study (EGAS00000000083) contains cDNA microarray profiling of about 2000 samples performed on the Illumina HT-12 v3 platform (Illumina_Human_WG-v3) [16]. The samples were originally partitioned into two subsets: Discovery (997 samples) and Validation (989 samples), respectively used as training and test sets in our analysis. In this cohort, tumour samples were assigned on the five intrinsic subtypes (luminal A, luminal B, HER2-enriched, basal-like and normal-like) according to the PAM50 method [1]. The METABRIC study was approved by the Institutional Review Board [16] and our research was authorized by the Human Ethics Research Committee at The University of Newcastle, Australia (H-2013-0277).

Refinement method

The overview of the refinement method applied on the METABRIC dataset is shown in Fig. 1. The process is initialized with the discovery set and the original PAM50 labels as defined in Curtis et al. [16]. After computing the CM1 score, the top ten highly discriminative probes (five with the greatest positive CM1 score values - indicating up-regulated probes relative to the other subtypes, and five with the smallest negative values – representing down-regulation) are chosen for each class. The set of new features is used to train the 24 classifiers from the Weka software suite [18], where a ten-fold cross-validation is performed. If the majority of the classifiers agree on the same label, the sample is assigned with the corresponding subtype; otherwise it is marked as inconsistent and not further considered in the process. The stopping criterion is reached when there are no more changes in the sample labels and feature set, or when the desired Fleiss’ kappa value (κ≥0.92) is achieved between the previous and the current iteration steps (see Section Statistics below for definitions). Values between 0.81 and one are considered to be almost perfect agreement, thus 0.92 is above the average for this interval.

Fig. 1
figure 1

Refinement process. The process is initialized with labels assigned using the PAM50 method. After computing the CM1 score, the top 10 highly discriminative probes are selected for each subtype. This set of features is used to train the 24 distinct classifiers for a 10-fold cross-validation classification. Samples are relabelled (eventually with the same label) if the classifiers agree in at least 50 % of the cases; otherwise they are marked as inconsistent and not further considered in the iteration process. The stopping criterion is reached when there are no more changes in the sample labels or selected feature set, or when the desired Fleiss’ kappa is achieved. After stopping, the final feature set and sample labels are used to classify the samples previously marked as inconsistent or from the validation dataset. These samples are run through the same refinement procedure; inconsistent samples are reclassified and labels are refined

When the stopping condition is fulfilled, the new list of features and sample labels are used for the training-test setting. Samples from the validation dataset or previously marked as inconsistent are then classified by training the classifiers in the refined discovery set. However, in the training-test setting, at least two thirds of classifiers in the ensemble must agree on the same label for it to be assigned to a sample. As a larger dataset is expected to provide more robustness, all the reclassified samples are run through the same refinement procedure again. The final outcome of this process is the set of refined features and the new labels.

Since many classifiers tend to perform best when trained on classes of equal sample size, we adjusted the number of patients in each subtype by looking at the minimum number of samples in one of the subgroups. The normal-like subtype is represented by only 58 samples; thus, the total number of samples used in the training is 290. For each other subtype, 58 samples are randomly chosen from the dataset. The whole process is run ten times due to the interchangeable sample selection that weigh the different gene expression information used for training purposes.

The CM1 score

The CM1 score is a supervised method used to rank the variation of gene expression levels across samples from two different classes (more details in [17, 19]). The measure helps to identify the most discriminative features for each of the five breast cancer intrinsic subtypes: luminal A, luminal B, HER2-enriched, basal-like and normal-like. For a given subtype, we compute the CM1 score for each of the 48,803 probes and select the ten most discriminative ones. This happens iteratively in the refinement process each time the classifiers attribute a new label to a sample.


Several measures have been computed in order to assess the quality of our results. We created a contingency table r×c comparing the predicted labels (rows) and labels from the previous refinement step (columns).

Cramer’s V [20] is used to measure the level of association between sample original and predicted labels. The statistic ranges from zero (no association between the two variables) to one (complete association).

Fleiss’ kappa [21, 22] is a popular interrater reliability metric used to gauge the agreement between the original PAM50 labels and the labels assigned by the majority of classifiers. Kappa values range from ≤0 to 1, where: (1) values ≤0 show a poor agreement; (2) 0≤κ≤0.2, slight agreement; (3) 0.21≤κ≤0.40, fair agreement; (4) 0.41≤κ≤0.60, moderate agreement; (5) 0.61≤κ≤0.80substantial agreement; and (6) 0.81≤κ≤1, almost perfect agreement.

Adjusted Rand Index (ARI) [23, 24] measures the agreement between pairs of samples that are labelled either in the same class or in different classes. Results range from zero (complete discordance between two partitions) to one (perfect concordance between them).

Clinical data and survival analysis

The clinical markers oestrogen and progesterone receptors (ER and PR) and the human epidermal growth factor receptor two (HER2) are compared between original METABRIC labels and refined labels. Survival analysis was also performed, using Cox proportional hazards model from the package survival in the R software [25]. The p-value, used to test the null hypothesis that the curves stratified by subtype are identical in the overall population, is calculated using the log-rank test.

Results and Discussion

Discriminative probes used to assign the intrinsic subtype labels in the refinement process

Samples were assigned into the five intrinsic subtypes based on the majority voting of classifiers (Additional file 1: Table S1), supported by their consistent performance across the ten runs (Additional file 2). During this procedure, 74 discriminative probes appeared (Additional file 1: Table S2) and, among them, 35 were recurrently selected (Fig. 2). Overall, the association between the initial labels and those predicted using the ensemble learning (Table 1) was 0.95 according to Cramer’s V. The consensus of sample labelling across different classifiers measured using Fleiss’ kappa was 0.924. The ARI (1.00) also showed a maximum agreement between pairs of samples that are labelled either in the same class or in different classes.

Fig. 2
figure 2

The heat map of refined intrinsic features selected using CM1 score in the refinement process. The heat map diagram exhibit 35 probes (rows) and 1992 samples (columns) from the discovery and validation sets ordered according to gene expression similarity. For visualisation, the expression values are normalised across the probes using a two-sided threshold of 1 % (for under- and over-expression). The bars on the bottom show the sample distribution according to the refined and original labels assigned to the METABRIC cohort. The subtypes are defined as follow: luminal A (blue), luminal B (green), HER2-enriched (yellow), normal-like (purple), basal-like (red), and inconsistent (grey)

Table 1 Contingency table for predicted labels vs. initial subtypes (rows and columns, respectively)

New subtype labels reveal more reliable distribution of clinical markers and survival curves

We correlated the METABRIC and predicted labels with the current clinical markers ER, PR and HER2. Table 2 shows the changes in number of samples across subtypes, labelled with the PAM50 method and refined labels, respectively. The refinement process improved the overall distribution to what is expected for each class: luminal A (ER+, PR+, HER2–), luminal B (ER+, PR ±, HER2 ±), HER2-enriched (ER–, PR–, HER2+) and basal-like (ER–, PR–, HER2–); especially for HER2-enriched and basal-like subtypes. Samples labelled as inconsistent in our study may also reflect the heterogeneity of the disease and a hint to as-yet improperly characterized molecular subtypes.

Table 2 Number of samples for each clinical marker in the PAM50 subtypes and refined labels

Furthermore, the patient’s overall survival significantly improved across subtypes when the original and refined labels are used to plot the curves for the METABRIC discovery and validation sets (Fig. 3). The groups have a well defined separation after the refinement process (p value 2.8×10−26) compared to the original labels (p value 5.4×10−18). These results also support a better characterization of the intrgroups after the iterative approach.

Fig. 3
figure 3

The survival curves for original and refined labels in the METABRIC discovery and validation sets. Each curve represents the survival probability at a certain time after the diagnosis. Drops in the curve indicate death. The probability of the last ten observations are plotted in dash


The iterative approach using CM1 score and ensemble learning has shown a great potential for predicting more accurate sample subtypes in the METABRIC breast cancer dataset. The refined labels are of great value to breast cancer research and future clinical translational science. Given the relevance of accurate subtype assignments, we encourage researchers to consider the proposed refined labels when analysing the METABRIC dataset.



adjusted rand index


european genome-phenome archive


oestrogen receptor


human epidermal growth factor receptor 2


molecular taxonomy of breast cancer international consortium


progesterone receptor


single sample predictor


  1. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009; 27(8):1160–1167. doi:10.1200/JCO.2008.18.1370.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature. 2000; 406(6797):747–52. doi:10.1038/35021093.

    Article  PubMed  CAS  Google Scholar 

  3. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001; 98(19):10869–10874. doi:10.1073/pnas.191367098.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Sørlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003; 100(14):8418–423. doi:10.1073/pnas.0932692100.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Hu Z, Fan C, Oh DS, Marron JS, He X, Qaqish BF, et al. The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics. 2006;7(96). doi:10.1186/1471-2164-7-96.

  6. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc National Acad Sci. 2002; 99(10):6567–572.

    Article  CAS  Google Scholar 

  7. Weigelt B, Mackay A, A’hern R, Natrajan R, Tan DS, Dowsett M, et al. Breast cancer molecular profiling with single sample predictors: a retrospective analysis. Lancet Oncol. 2010; 11(4):339–49. doi:10.1016/s14702045(10)70008-5.

    Article  PubMed  CAS  Google Scholar 

  8. Ebbert M, Bastien R, Boucher K, Martin M, Carrasco E, Caballero R, et al. Characterization of uncertainty in the classification of multivariate assays: application to pam50 centroid-based genomic predictors for breast cancer treatment plans. J Clin Bioinforma. 2011; 1(1):37. doi:10.1186/2043-9113-1-37.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Haibe-Kains B, Desmedt C, Loi S, Culhane AC, Bontempi G, Quackenbush J, et al. A three-gene model to robustly identify breast cancer molecular subtypes. J National Cancer Inst. 2012; 104(4):311–25. doi:10.1093/jnci/djr545.

    Article  CAS  Google Scholar 

  10. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. The Lancet. 2005; 365(9460):671–9. doi:10.1016/S0140-6736(05)17947-1.

    Article  CAS  Google Scholar 

  11. Fan C, Prat A, Parker J, Liu Y, Carey L, Troester M, Perou C. Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures. BMC Medical Genomics. 2011; 4(1):3. doi:10.1186/1755-8794-4-3.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. PLoS Comput Biol. 2011; 7(10):1002240. doi:10.1371/journal.pcbi.1002240.

    Article  Google Scholar 

  13. Seoane JA, Day INM, Gaunt TR, Campbell C. A pathway-based data integration framework for prediction of disease progression. Bioinforma. 2014; 30(6):838–45. doi:10.1093/bioinformatics/btt610.

    Article  CAS  Google Scholar 

  14. Naderi A, Teschendorff AE, Barbosa-Morais NL, Pinder SE, Green AR, Powe DG, et al. A gene-expression signature to predict survival in breast cancer across independent data sets. Oncogene. 2006; 26(10):1507–1516. doi:10.1038/sj.onc.1209920.

    Article  PubMed  Google Scholar 

  15. Gómez Ravetti M, Moscato P. Identification of a 5-protein biomarker molecular signature for predicting alzheimer’s disease. PLoS ONE. 2008; 3(9):3111. doi:10.1371/journal.pone.0003111.

    Article  Google Scholar 

  16. Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012; 486(7403):346–52. doi:10.1038/nature10983.

    PubMed  CAS  PubMed Central  Google Scholar 

  17. Milioli HH, Vimieiro R, Riveros C, Tishchenko I, Berretta R, Moscato P. The discovery of novel biomarkers improves breast cancer intrinsic subtype prediction and reconciles the labels in the metabric data set. PLoS ONE. 2015; 10(7):0129711. doi:10.1371/journal.pone.0129711.

    Article  Google Scholar 

  18. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann; 2005.

    Google Scholar 

  19. Marsden J, Budden D, Craig H, Moscato P. Language Individuation and Marker Words: Shakespeare and His Maxwell’s Demon. PLoS ONE. 2013; 8(6):66813. doi:10.1371/journal.pone.0066813.t001.

    Article  Google Scholar 

  20. Liebetrau AM, Vol. 32. Measures of Association. Beverly Hills, CA: SAGE Publications, Inc; 1983.

    Google Scholar 

  21. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971; 76(5):378–82.

    Article  Google Scholar 

  22. Fleiss JL, Levin B, Paik MC. The Measurement of Interrater Agreement. New York: John Wiley & Sons, Inc; 2004, pp. 598–626. Chap. 18.

    Google Scholar 

  23. Hubert L, Arabie P. Comparing partitions. J Classif. 1985; 2(1):193–218. doi:10.1007/BF01908075.

    Article  Google Scholar 

  24. Vinh NX, Epps J, Bailey J. Information Theoretic Measures for Clusterings Comparison: Is a Correction for Chance Necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning: 2009. p. 1073–1080. ACM.

  25. Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. New York: Springer; 2000.

    Book  Google Scholar 

Download references


This study makes use of data generated by the Molecular Taxonomy of Breast Cancer International Consortium. Funding for the project was provided by Cancer Research UK and the British Columbia Cancer Agency Branch.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pablo Moscato.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

HHM, RV, CR and PM conceived the original design for the iterative approach. RV, CR and IT led the algorithm development and application. HHM drafted the manuscript. All authors assisted with the editing and reviewing. All authors read and approved the final manuscript.


Australian Research Council (ARC) Future Fellowship FT120100060 and ARC Discovery Project DP120102576; Cancer Institute of New South Wales, Big Data Big Impact Grant 13/DATA/1-03.

Additional files

Additional file 1

Refined subtype labels and intrinsic probes. The refined breast cancer subtype labels defined for each sample in the METABRIC dataset are listed in Table S1. Table S2 shows the annotated probes selected in the CM1 list and the average occurrence of each probe. (XLSX 58 kb)

Additional file 2

Classifiers Performance. The document contains information on the ensemble learning approach with regards to the performance of each classifier. (PDF 1280 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Milioli, H.H., Vimieiro, R., Tishchenko, I. et al. Iteratively refining breast cancer intrinsic subtypes in the METABRIC dataset. BioData Mining 9, 2 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: