### Datasets

Two datasets were used to test these five algorithms, GDS1620 and metabolic pathway dataset for *A. thaliana*. The former was downloaded from GEO [16], and the latter was downloaded from [17]. Since the two gene expression datasets are for *A. thaliana*, the results based on them would be comparable.

The dataset of GDS1620 is about abiotic stress-inducing agents effect on suspension cell cultures. It contains expression profiles of 22810 probe sets under 37 conditions. The Bioconductor [18] and R [19] software were used to pre-process the dataset GDS1620 including nonspecific filtering; removing the control probe sets and duplicated probe sets. After the pre-processing, there were only 3881 probe sets and 16 conditions left.

The dataset of metabolic pathway contains expression profiles of 734 genes under 69 conditions.

### Selected algorithms

Five biclustering algorithms (i.e. BIMAX, FABIA, ISA, QUBIC and SAMBA) were chosen for comparison, the implementations of which were all available from the original publications. Among these algorithms, BIMAX, ISA and SAMBA have been used or referenced frequently in previous studies. In contrast, FABIA and QUBIC are relatively new methods and the comparisons are more valuable.

### Gene ontology weighted enrichment score

For real transcriptome datasets, the most meaningful evaluation of biclusters is biological interpretation.

For each identified bicluster, we used the cytoscape plugin, i.e. BiNGO [20], to perform GO enrichment analysis in biological processes namespace. Hyper geometric tests were used for statistical analysis and the Benjamin-Hochberg False Discovery Rate (FDR) procedure [21] was used for the multiple tesing corrections. We selected 0.05 as significance level.

*P-value* is the probability of that *x* number of genes from a bicluster of size *X* annotated to a particular GO term, given *P* which is the proportion of genes in the whole genome annotated to that GO term. So the *p-value* can be evaluated using the following hyper-geometric function [22],

p-value=1-\sum _{i=1}^{x-1}\frac{\left(\begin{array}{l}PN\\ i\end{array}\right)\left(\begin{array}{l}N-PN\\ X-i\end{array}\right)}{\left(\begin{array}{l}N\\ X\end{array}\right)}

(1)

where *N* is the total number of genes in the whole genome. The closer the *p-value* is to zero; the more significant is the association of the particular GO term with the group of genes.

For all GO terms significantly associated with a bicluster, we processed the *p-value* of every GO term on –log scale as the enrichment score of this GO term, and then used the weighted mean of these scores as enrichment score of this bicluster. As a matter of fact, the GO term associated with more genes may not have higher enrichment score, instead, it accounts for more proportions of genes in the bicluster. So we consider this term contribute more to the enrichment score of this bicluster and the weight of each GO terms is{x}_{i}/X, where {x}_{i} is the number of genes in this bicluster significantly annotated to the *i-th* GO term and *X* is the total number of genes belonging to the bicluster which contains three parts: (1) genes enriched to a GO term; (2) genes that have not been annotated; and (3) genes that are not enriched to any GO term but have been annotated. Therefore, the WE score of this bicluster is described as:

\begin{array}{l}WE-score=\frac{{s}_{1}{x}_{1}/X+{s}_{2}{x}_{2}/X+\cdots +{s}_{n}{x}_{n}/X+non*0/X}{{x}_{1}/X+{x}_{2}/X+\cdots +{x}_{n}/X+non/X}\\ =\frac{{x}_{1}{s}_{1}+{x}_{2}{s}_{2}+\cdots +{x}_{n}{s}_{n}}{{x}_{1}+{x}_{2}+\cdots +{x}_{n}+non}\end{array}

(2)

{s}_{i}=-log({p}_{i})

(3)

where {p}_{i} is the *p-value* of the *i-th* GO term; *n* is the number of GO terms to which the genes from this bicluster are significantly enriched; *non* is the number of genes which are not significantly enriched to any GO term but have the annotation. From the expression of the WE score, we can see that the value of WE score do not have relationship with *X*, i.e. WE score does not have relationship with no annotation genes. So, the higher WE score is; the more biologically significant the bicluster would be.

### Protein-protein interaction score

Interactions between proteins provide a basis for most biological processes in an organism [23], and hence the networks formed by interacting proteins provided us with crucial platform to analysis the physical and functional association in various biological processes. In this study, we used the protein-protein interaction networks to assess the quality of the biclusters, as genes that show similar expression patterns may participate in the same interaction network. In order to compare the biclusters from different algorithms, we proposed a PPI (protein-protein interaction) scoring method.

In this work, we localized the PPI of *Arabidopsis thaliana* from database STRING (http://string-db.org/) [24], which integrates and weights information from numerous sources, including conserved neighborhood, gene fusions, phylogenetic co-occurrence, co-expression, database imports(e.g. MINT, HPRD, BIND, DIP, BioGRID, KEGG and Reactome), large-scale experiments, literature co-occurrence [25]. Interactions from these data sources are benchmarked and scored against a common reference that joints membership of proteins in biological pathways, as annotated at KEGG [26]. The scores higher than 0.7 will be considered as high confidence, and the confidence increases when methods were combined [25]. We took the interactions between two genes with combined scores higher than 0.7 into consideration.

The PPI score of a bicluster is calculated by the following expression,

PPI-score=\frac{I}{N-M}

(4)

where *I* is the number of genes which have interaction relationship with other genes in the same bicluster, *N* is the total number of genes in this bicluster, and *M* is the number of genes in this bicluster which have not been found to interact with any genes according to all data in STRING database.