We compared the performance of SSCC with SSC, LCE and *k*-means and each of our pairwise comparison provides information of the effect of either semi-supervision or consensus clustering. Specifically, comparing LCE with *k*-means reveals the effectiveness of ensemble strategy since *k*-means is used as the base clustering in LCE. Similarly, in comparing SSC with SSCC, we used the same amount of prior knowledge, so actually we compared spectral clustering with consensus clustering. The comparison between SSCC and LCE reveals the effect of semi-supervision under the consensus clustering paradigm.

SSCC significantly outperforms SSC with or without prior knowledge. This clearly shows that consensus clustering algorithms outperform single clustering algorithms in the gene expression datasets. This observation is consistent with [1–4].

We compared SSCC with LCE using the same datasets and same parameter settings. Without considering prior knowledge, the difference between SSCC and LCE is in base clustering, SSCC uses spectral clustering but LCE uses *k*-means. They both use spectral clustering for final clustering (Table 1). Without prior knowledge, SSC becomes SC, and SC outperforms *k*-means in all 8 datasets (Figures 1, 2 and Table 3). This indicates the performance of base clustering has significant influence on results of consensus clustering.

SSCC consists of spectral clustering and LCE. The majority of computational time of spectral clustering spends on finding

*t* nearest neighbors [

20]. The time complexity of obtaining

*t* nearest neighbor sparse matrix is

*O*(

*n*
^{2}
*d*)+

*O*(

*n*
^{2} log

*t*), where

*n* is the number of samples,

*d* is the number of genes in the graph of spectral clustering. We use the fixed number of cluster

*k* in LCE, the time complexity of generating a cluster-association matrix

*R* is

*O*(

*m*
^{2}
*k*
^{2}+

*n*
*m*
*k*)+

*O*(

*m*
^{2}
*k*
^{2}
*t*
^{′}+

*n*
*m*
*k*), where

*m* is ensemble size, and

*t*
^{′} is the average number of neighbors connecting to one cluster in a network of clusters in final clustering. In SSCC, the complexity of generating

*l* pairwise constraints is

*O*(

*l*). The overall time complexity of SSCC using “Fixed k + subspace” ensemble type is

Since *n*>*m*, *n*>*k*, *d*>*n*, *d*>*l*, and *d*>*t* in our experiments, the bottle neck of SSCC is to find *t* nearest neighbors with computational time *O*(*m*
*n*
^{2}
*d*). The implementation of spectral clustering is a parallel algorithm [20], so the majority of computational time of SSCC can be reduce to
, where *p*
^{′} is the number of parallel threads. SSCC is limited to large data set due to the computational complexity of spectral clustering. SSCC can be improved by adopting faster spectral clustering algorithms, which are applicable for data sets with thousands of instances.

Our study provided an insight into the contribution of consensus clustering and semi-supervised clustering to the clustering results. To our knowledge, the Knowledge based Cluster Ensemble (KCE) [14] is the only algorithm using prior knowledge in consensus clustering paradigm for gene expression datasets. Unfortunately, we are unable to directly compare SSCC with KCE because of the unavailability of the software.

Our study uses SSCC for clustering samples. Since the optimal number of clusters (*k* in *k*-means algorithm) and the class label of each sample are known, the prior knowledge is derived from the given class structure. A *must-link* constraint is given to a pair of samples if they are from the same class. For many real applications, we might not know the whole class structure, but most likely we know whether some of samples are in the same class (cluster). We can generate *must-links* between these samples, and prior knowledge is derived from these samples. In these cancer gene expression datasets, we validate the performance of SSCC with the labeled data. The next step would be to apply SSCC for clustering genes for gene function prediction. However, the performance on clustering genes might vary due to two reasons: the quality of prior knowledge and the optimal number of clusters. Pairwise constraints in this study have been generated from class labels of samples in the cancer gene expression datasets and they are true prior knowledge. Prior knowledge in clustering of genes will be known gene functions, and they are partial domain knowledge. A gene may have multiple functions; some functions are inclusive to others as well. For example, a level 6 gene ontology term apoptotic process (GO:0006915) has over ten thousands of gene products and under which at level 7, there are 21 GO terms. Our earlier work shows that more specific (higher level) GO term contribute better to semi-supervised clustering result [13]. Also the description of a certain gene function is based on current knowledge in the domain field. Such domain knowledge is often subject to change. For example, current knowledge of certain existing gene is limited and will gradually be enriched. Therefore, the generated prior knowledge from a pair of genes most likely contains certain noise and subsequently influence the results. The optimal number of clusters is often unknown and a different distance measure would generate a different optimum number of clusters. Therefore, for comparison of semi-supervised clustering algorithms, it is better to use defined prior knowledge, such as the sample labels we used in this paper. When an algorithm considered to be superior over the others, such an algorithm can be used to cluster genes.

In reality, obtaining large amount of prior knowledge for gene expression datasets is difficult. Designing algorithms which work best with a small amount of prior knowledge, such as less than 20 pairwise constraints, will be very useful for clustering microarray data. A study on semi-supervised clustering shows that with small amounts of prior knowledge, search-based approach tends to outperform similarity-based [31]. With larger amounts of labeled data, similarity-based tends to perform better. Combining both approaches outperforms respective individual approaches. SSC is a similarity-based semi-supervised clustering algorithm. The results in Figures 1, 2 show that the performance of SSCC and SSC is slightly improved with small numbers of constraints and significantly improved with increasing numbers of constraints. Our SSCC method presented in this paper is applicable not only to gene expression data, but also to other types of data as long as prior knowledge is provided.