Our semi-supervised consensus clustering algorithm (SSCC) includes a base clustering, consensus function, and final clustering. We use semi-supervised spectral clustering (SSC) as the base clustering, hybrid bipartite graph formulation (HBGF) as the consensus function, and spectral clustering (SC) as final clustering in the framework of consensus clustering in SSCC.
Spectral clustering
The general idea of SC contains two steps: spectral representation and clustering. In spectral representation, each data point is associated with a vertex in a weighted graph. The clustering step is to find partitions in the graph. Given a dataset X‒=‒{x
i
|i‒=‒1, …, n} and similarity s
i
j
‒≥‒0 between data points x
i
and x
j
, the clustering process first construct a similarity graph G‒=‒(V, E), V‒=‒{v
i
}, E‒=‒{e
i
j
} to represent relationship among the data points; where each node v
i
represents a data point x
i
, and each edge e
i
j
represents the connection between two nodes v
i
and v
j
, if their similarity s
i
j
satisfies a given condition. The edge between nodes is weighted by s
i
j
. The clustering process becomes a graph cutting problem such that the edges within the group have high weights and those between different groups have low weights. The weighted similarity graph can be fully connected graph or t-nearest neighbor graph. In fully connected graph, the Gaussian similarity function is usually used as the similarity function s
i
j
= exp(-∥x
i
-x
j
∥2/2σ
2), where parameter σ controls the width of the neighbourhoods. In t-nearest neighbor graph, x
i
and x
j
are connected with an undirected edge if x
i
is among the t-nearest neighbors of x
j
or vice versa. We used the t-nearest neighbours graph for spectral representation for gene expression data.
Semi-supervised spectral clustering
SSC uses prior knowledge in spectral clustering. It uses pairwise constraints from the domain knowledge. Pairwise constraints between two data points can be represented as must-links (in the same class) and cannot-links (in different classes). For each pair of must-link (i,j), assign s
i
j
= s
j
i
= 1, For each pair of cannot-link (i, j), assign s
i
j
= s
j
i
= 0.
If we use SSC for clustering samples in gene expression data using t-nearest neighbor graph representation, two samples with highly similar expression profiles are connected in the graph. Using cannot-links means to change the similarity between the pairs of samples into 0, which breaks edges between a pair of samples in the graph. Therefore, only must-links are applied in our study. The details of SSC algorithm is described in Algorithm 1. Given the data points x
1, …, x
n
, l pairwise constraints of must-link are generated. The similarity matrix S can be obtained using similarity function s
i
j
= exp(-∥x
i
- x
j
∥2/2σ
2). σ is the scaling parameter for measuring when two points are considered similar, and was calculated according to [15]. Then S is modified to be a sparse matrix, only t nearest neighbors are kept for each data point in S. Then, l pairwise constraints are applied in S. Steps 5-10 follow normalized spectral clustering algorithm [16, 17].
Consensus function
We used LCE ensemble framework in our SSCC adopting HBGF as the consensus function. The cluster ensemble is represented as a graph that consists of vertices and weighted edges. HBGF models both instances and clusters of the ensemble simultaneously as vertices in the graph. This approach retains all information provided by a given ensemble, allowing the similarities among instances and among clusters to be considered collectively in forming the final clustering [18]. More details about LCE can be found in [4].
Semi-supervised consensus clustering
To make a consensus clustering into a semi-supervised consensus clustering algorithm, prior knowledge can be applied in base clustering, consensus function, or final clustering. Final clustering is usually applied on the consensus matrix generated from base clustering. SSCC uses semi-supervised clustering algorithm SSC for base clustering, does not use prior knowledge either in consensus function or final clustering. Our experiment was performed using h-fold cross-validation. The dataset was split into training and testing sets, and the prior knowledge was added to the h-1 folds training set. After the final clustering result was obtained, it was evaluated on the testing set alone. The influence of prior knowledge could be assessed in a cross-validation framework.
Our semi-supervised consensus clustering algorithm is described in Algorithm 2. Similar to [4], for a given n × d dataset of n samples and d genes, a n × q data subspace (q < d) is generated by
(1)
α ∈ [ 0,1] is a uniform random variable, q
m
i
n
and q
m
a
x
are the lower and upper bonds of the subspace. q
m
i
n
and q
m
a
x
are set to 0.75d and 0.85d. Let be a cluster ensemble with m clustering solutions. SSC is applied on each subspace dataset to obtain clustering results. We use the fixed number of clusters k, each is one clustering solution. A basic cluster-association matrix BM is generated at first based on the crisp associations between samples and clusters using HBGF, in which there are n samples and m×k clusters. If x
i
belongs to a cluster C
j
, B M(x
i
,C
j
)=1,i=1,…,n; j=1,…,g, otherwise B M(x
i
,C
j
)=0. Next, a refined cluster-association matrix RM is generated from BM by estimating new association values in R M(x
i
,C
j
) if B M(x
i
,C
j
)=0. R M(x
i
,C
j
) is the similarity between C
j
and other clusters to which x
i
probably belongs. The similarity of any clusters in the cluster ensemble is obtained from a weighted graph of clusters. Finally, spectral clustering is applied on RM to obtain the final clustering solution.