This article has Open Peer Review reports available.
A novel biclustering algorithm of binary microarray data: BiBinCons and BiBinAlter
 Haifa Ben Saber^{1}Email author and
 Mourad Elloumi^{1, 2}
Received: 4 January 2015
Accepted: 8 November 2015
Published: 30 November 2015
Abstract
The biclustering of microarray data has been the subject of a large research. No one of the existing biclustering algorithms is perfect. The construction of biologically significant groups of biclusters for large microarray data is still a problem that requires a continuous work. Biological validation of biclusters of microarray data is one of the most important open issues. So far, there are no general guidelines in the literature on how to validate biologically extracted biclusters. In this paper, we develop two biclustering algorithms of binary microarray data, adopting the Iterative Row and Column Clustering Combination (IRCCC) approach, called BiBinCons and BiBinAlter. However, the BiBinAlter algorithm is an improvement of BiBinCons. On the other hand, BiBinAlter differs from BiBinCons by the use of the EvalStab and IndHomog evaluation functions in addition to the CroBin one (Bioinformatics 20:1993–2003, 2004). BiBinAlter can extracts biclusters of good quality with better pvalues.
Keywords
Introduction
DNA microarray technology is a revolutionary tool enabling the measurement of expression levels of thousands of genes in a single experiment under diverse experimental conditions. This technology allows us to obtain big raw data that can provide a wealth of information on the concerned genes. It proved to be a valuable tool for many biological and medical applications. Indeed, microarray data analysis is a crucial step for these applications in order to extract pertinent biological knowledge embedded in these large masses of data. However, the extraction process of this knowledge is far from being trivial. From here comes the necessity to adopt data mining techniques. Many of these techniques were applied to these data in order to extract pertinent biological knowledge. Among the techniques that are used, we mention those of clustering [1]. Indeed, by making a clustering, we consider that all the genes of a group can have a similar behavior under all the conditions. However, there are genes that have a similar behavior only under a subset of conditions. Hence, clustering is too simplistic to detect such cases [1]. Another more interesting technique, called biclustering [2], allows to identify groups of genes that have a similar behavior only under a subset of conditions.
In this paper, we develop new biclustering algorithms of microarray data. These data are usually coded by a data matrix M(I,J), where the i^{ t h } row, i∈I={1,2,…,n}, represents the i^{ t h } gene, the j^{ t h } column, j∈J={1,2,…,m}, represents the j^{ t h } condition and the cell M[i,j] represents the expression level of the i^{ t h } gene under the j^{ t h } condition.
The main objective is then to identify groups of genes that are coherent under groups of conditions, these groups are called biclusters. Genes belonging to the same bicluster have close biological functions. Let’s note that, in its general form, the biclustering problem is NPhard [2].
The rest of this chapter is organized as follow: In the second section, we introduce some preliminaries. In the third section, we present the BiBinCons algorithm. In the fourth section, we present the BiBinAlter algorithm. In the fifth section, we present an illustrative example and an experimental study. Finally, we present the conclusion of this paper.
Preliminaries
where z={z,z_{2},…,z_{ g }} is the matrix defined as a partition of I into g clusters, i.e. z_{ i } is the cluster number of the i^{ t h } row of M_{ b }(I,J). w={w_{1},w_{2},…,w_{ h }} is the matrix defined as a partition of J into h clusters, i.e. w_{ i } is the cluster number of the j^{ t h } column of M_{ b }(I,J).white.whe
where z_{ ik }=1 if the i^{ t h } row of M_{ b }(I,J) belongs to the k^{ t h } cluster of I otherwise z_{ ik }=0. w_{ jl }=1 if the j^{ t h } column of M_{ b }(I,J) belongs to the l^{ t h } cluster of J otherwise w_{ jl }=0.

If w is fixed, the minimization is given by:$$ W(z,aw)=\sum\limits_{i,k,l}z_{ik}u_{il}(w_{l}\times a_{kl}) $$(2.4)
where \(u_{\textit {il}}=\sum _{j\in w_{l}}m_{\textit {ij}}=\sum _{j}w_{\textit {jl}}m_{\textit {ij}}, \sum \limits _{\textit {i,j,k,l}} z_{\textit {ik}}w_{\textit {jl}}\left m_{\textit {ij}}^{b}a_{\textit {kl}}\right =\underset {i,k}{\sum }z_{\textit {ik}}\underset {j,l}{\sum }w_{\textit {jl}}\left m_{\textit {ij}}^{b}a_{\textit {kl}}\right =\underset {i,k}{\sum }z_{\textit {ik}}\underset {l}{\sum }u_{\textit {il}}(w_{l}\times a_{\textit {kl}})\), u is a matrix of size I×l.

If z is fixed, the minimization is given by:$$ W(w,az)=\sum\limits_{i,k,l}w_{jl}v_{jl}(z_{k}\times a_{kl})\vspace*{3pt} $$(2.5)
where \(v_{\textit {kj}}=\sum _{i\in z_{k}}m_{\textit {ij}}^{b}=\sum _{i}z_{\textit {ik}}m_{\textit {ij}}\), \(\underset {\textit {i,j,k,l}}{\sum }z_{\textit {ik}}w_{\textit {jl}}\left m_{\textit {ij}}^{b}a_{\textit {kl}}\right =\sum \limits _{\textit {j,l}} w_{\textit {jl}} \sum \limits _{i,k} z_{\textit {ik}}\left m_{\textit {ij}}^{b}a_{\textit {kl}}\right = \sum \limits _{\textit {i,k}} z_{\textit {ik}} \sum \limits _{l} v_{\textit {kj}}(z_{k}\times a_{\textit {kl}})\), v is a matrix of size k×J.
Remark.
A colored block in the binary matrix M_{ b }(I,J) will be represented by a colored cell in the summary matrix A, where each colored cell contains the majority binary value in the corresponding colored block, e.g, if the majority of cells in a block in M_{ b }(I,J) contains 1 then the corresponding cell in A contains also 1.
Example.
This example shows a binary data matrix M_{ b }(I,J) and the corresponding cell in the summary matrix A.
A=(0,0;1,1;1,1), i.e., a_{11}=0,a_{12}=0;a_{21}=1, a_{22}=1;a_{31}=1,a_{32}=1.
In the section ‘FIRST IRCCC Algorithm: BiBinCons’, we develop two IRCCC algorithms of biclustering of binary microarray data, called respectively BiBinCons and BiBinAlter.
FIRST IRCCC Algorithm: BiBinCons
Our biclustering algorithm, BiBinCons receives as input a binary matrix M_{ b }(I,J) and gives as output (z_{ opt },w_{ opt },A_{ opt }), where z_{ opt } and w_{ opt } are respectively the final clustering of rows and columns of M_{ b }(I,J), and A_{ opt } is the summary matrix related to z_{ opt } and w_{ opt }. To describe more formally our biclustering algorithm, BiBinCons, we use the following notations:
z_{0} : initial clustering of rows of M_{ b }(I,J)
w_{0} : initial clustering of columns of M_{ b }(I,J),
A_{0} : initial summary matrix related to z^{0} and w^{0}
z_{ c } : current clustering of rows of M_{ b }(I,J)
w_{ c } : current clustering of columns of M_{ b }(I,J),
\(A_{c}^{'}\) : current intermidate summary matrix related to z^{ c } and w^{c−1}
A_{ c } : current summary matrix related to z^{ c } and w^{ c }
z_{ opt } : final clustering of rows of M_{ b }(I,J)
w_{ opt } : final clustering of columns of M_{ b }(I,J)
A_{ opt } : final summary matrix related to z^{ o p t } and w^{ o p t }
\(A_{c}^{'}\) : intermediate current summary matrix.
Second IRCCC Algorithm: BiBinAlter
Our biclustering algorithm, BiBinAlter receives as input a binary matrix M_{ b }(I,J) and gives as output (z_{ opt },w_{ opt },A_{ opt }), where z_{ opt } and w_{ opt } are respectively the final clustering of rows and columns of M_{ b }(I,J), and A^{ o p t } is the summary matrix related to z_{ opt } and w_{ opt }. By adopting BiBinAlter, we propose the use of functions defined:
EvalStab_{ c } represents the frequency of 0’s in the current group of biclusters at the c^{ t h } iteration. It is defined as follows:
To describe more formally our biclustering algorithm, iBinAlter, we have used the same notations like previous algorithm besides of these notations:
(EvalStab_{ c },IndHomog_{ c }): couple to present the frequency of 0’s in the current group of biclusters at the c^{ t h } iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the c^{ t h } iteration.
(EvalStab_{c−1},IndHomog_{c−1}): couple to present the frequency of 0’s in the group of biclusters at the (c−1)^{ t h } iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the (c−1)^{ t h } iteration.
\(\left (EvalStab_{(c1)}^{'},IndHomog_{(c1)}^{'}\right)\): couple to present the frequency of 0’s in the group of biclusters at the intermidate (c−1)^{′}^{ t h } iteration and the tradeoff between the number of mixed biclusters (containing both 0’s and 1’s) and the total number of biclusters at the intermidate (c−1)^{′}^{ t h } iteration.
Illustrative example
Let’s apply the BiBinAlter algorithm on the following binary matrix M_{ b }(I,J):

Initialization step
First, we initialize the rows and columns thanks to the initialization step of BiMax algorithm Prelić [3] and we compute (z_{0},w_{0},A_{0}), we obtain:
z_{0}=(1,2,2,3), w_{0}=(1,1,0,0,0), A_{0}=(1,0;1,1;0,1)

Biclustering step:
Iteration 1: c=1
Since we have
Iteration 2: c =2
Iteration 3: c =3
Iteration 4: c =4
Iteration 5: c =5
We have \((EvalStab_{4}^{'},IndHomog_{4}^{'}) = (EvalStab^{5},IndHomog^{5}))\) and (z_{5},w_{5},A_{5}) \(= (z_{4},w_{3},A_{4}^{'})\).
Then, we obtain (z_{ opt },w_{ opt },A_{ opt })=(z_{5},w_{5},A_{5}). Biclusters that contain only 0’s will not be considered because they represent genes that are not expressed under the related conditions. Finally, (z_{ opt },w_{ opt },A_{ opt }) can be represented in M_{ b }(I,J) as follows:
Results for synthetic datasets
In this section, we present an experimental study to evaluate the performance of our algorithms of microarray data. Indeed, we compare the results of our algorithms to those obtained by a selection of known algorithms cited in the literature. We conducted experiments on synthetic and real datasets of microarrays. The idea behind testing on synthetic datasets is to investigate the ability of our algorithms to extract different types of biclusters. However, on real datasets, we seek to assess the degree of response of our algorithms for statistical and biological criteria.
Synthetic microarray datasets and comparaison criteria

(a) Overlapping biclusters (overlapping rate =5 % (well separated), 15 % (fairly separated) and 25 % (poorly separated)).

(b) Different data sizes (matrix size =50×30 (small), 100×60 (medium) and 200×120 (large)).
where
S_{ cb } is the volume of correctly extracted biclusters, Tot_{ size } is the total volume of implemented biclusters and S_{ NCB } is the volume of not correctly extracted biclusters.
The Shared index (resp. NotShared) represents the percentage of correctly (resp. not correctly) extracted biclusters with respect to all implemented biclusters in the data matrix. Indeed, when the Shared value is equal to 100 %, the algorithm extracts all the implemented biclusters. When the value of NotShared is 0 %, the algorithm extracts no cell outside the implemented biclusters.
Experimental protocol
Corresponding parameters values of our algorithms
Algorithms  Corresponding parameters values 

BiBinCons  minrow = 2, mincol = 2 
BiBinAlter  minrow = 2, mincol = 2 
Values of Shared and NotShared for non overlapping biclusters
Algorithms  Shared  NotShared 

CC  18.21 %  36.57 % 
OPSM  46.39 %  74.42 % 
ISA  39.38 %  5.31 % 
BiMax  58.18 %  21.39 % 
BiBinCons  88 %  12 % 
BiBinAlter  100 %  37.03 % 
Values of Shared and NotShared for overlapping biclusters
Algorithms  Shared  NotShared 

CC  13.21 %  36.57 % 
OPSM  82.02 %  50.51 % 
ISA  29.28 %  7.31 % 
BiMax  48.18 %  22.39 % 
BiBinCons  87.30 %  61 % 
BiBinAlter  89.40 %  57.32 % 
Number of biclusters obtained by our algorithms on real datasets
Algorithms  Yeast cell cycle  Human Bcell Lymphoma 

EnumLat  883  1921 
DecBinBicluster  708  1720 
BiBinCons  529  1900 
BiBinAlter  881  1769 
RefineBicluster  708  1700 
Results of our algorithms on real datasets
In this section, we evaluate our algorithms on real microarray datasets.
Real microarray datasets
We have used two real microarray datasets: The Yeast cell cycle dataset which has been described and then pretreated in [1]. It contains the expression of 2884 genes in 17 terms ans the Human Bcell Lymphoma dataset which has been described by Alizadeh et al. [1], it contains 4026 genes and 96 conditions. These datasets are used frequently in the literature by biclustering algorithms.
Experimental protocol
The first experiments concern the statistical validation. It enables to calculate the coverage for Yeast cell cycle and Human Bcell Lymphoma datasets and the pvalue adjusted forHuman Bcell Lymphoma datasets. The second experiments was applied to Yeast cell cycle in order to study the biological significance of extracted biclusters.
Statistical validation
In order to validate statistically our algorithms on these real datasets, we evaluate the performance of BiBinCons and BibinAlter. We calculate the total number of cells covered by the biclusters. To do this, we have processed as in [2], and we have compared the results of our algorithms to those reported in [2]. In the literature, the coverage test was performed on Yeast cell cycle and Human Bcell Lymphoma datasets. This test is not applied to RefineBicluster algorithm because it is only a refinement algorithm.
Values of Coverage for Yeast cell cycle and Human Bcell Lymphoma datasets
Datasets  Algorithms  Total coverage  Genes coverage  Conditions coverage 

Yeast celll cycle  CC  81.47 %  97.12 %  100 % 
BiBinCons  39.14 %  44.5 %  100 %  
BiBinAlter  47 %  48,03 %  100 %  
Human Bcell Lymphoma  CC  36.81 %  91.58 %  100 % 
BiBinCons  34.14 %  37.51 %  100 %  
BiBinAlter  41 %  46.13 %  100 % 
Biological validation
The most important terms of GO for the two most significant extracted biclusters from Yeast cell cycle dataset by BiBinCons and BiBinAlter
Biclusters  Biological process  Molecular function  Cell component 

12 genes, 13 conditions  Cellular response to chromatin binding microtubule organizing 13 conditions DNA damage stimulus (25 %,0.00037) center part (66.7 %, 1:87 * 108) (16.7 %, 0.00742) response to DNA damage stimulus (66.7 %, 6:30 * 108) cellular response to stress (66.7 %, 2:12 * 107) cellular response to stimulus (66,7 %, 3:25 * 107) DNA repair (50 %, 2:58 * 105) response to stress (66.7 %, 2:98 * 105)  Chromatin binding microtubule organizing 13 conditions DNA damage stimulus (25 %,0.00037)  Microtubule organizing 13 conditions DNA damage stimulus (25 %,0.00037) center part (66.7 %, 1:87 * 108) (16.7 %, 0.00742) 
11 genes, 11 conditions  Cell cycle process GTPase activator microtubule cytoskeleton 11 conditions (63.6 %, 2:93 * 105) activity (18.2 %,0.00994) (45.5 %, 6:33 * 106) cell cycle microtubule organizing (63.6 %, 6:85 * 105)  GTPase activator microtubule cytoskeleton 11 conditions (63.6 %, 2:93 * 105) activity (18.2 %,0.00994)  Microtubule cytoskeleton 11 conditions (63.6 %, 2:93 * 105) activity (18.2 %,0.00994) (45.5 %, 6:33 * 106) cell cycle microtubule organizing (63.6 %, 6:85 * 105) center (36.4 %,4:97 * 105) spindle pole body (36.4 %, 4:97 * 105) spindle pole (36.4 %, 6:77 * 105) 
Computing time
Computing time of our algorithms
Datasets  BiBinCons  BiBinAlter 

Yeast Cell Cycle  32 min  37 min 12 sec 
Saccharomyces Cerevisiae  8 min  8 min 3 sec 
Conclusion
In this paper, we have developed two biclustering algorithms of binary microarray data, called BiBinCons and BiBinAlter, adopting the Iterative Row and Column Clustering Combination (IRCCC) approach, however, the BiBinAlter algorithm is an improvement of BiBinCons. On the other hand, BiBinAlter differs from BiBinCons by the use of the EvalStab and IndHomog evaluation functions in addition to the CroBin one [1]. BiBinAlter can extract biclusters of good quality with better pvalues. In this paper, we have presented an experimental study of our biclustering algorithms of microarray data. We have compared the results of our algorithms to those obtained by a selection of the known biclustering algorithms. We have conducted experiments on both synthetic and real datasets of microarrays. For both synthetic and real datasets, our biclustering algorithm BiBinAlter outperforms the other algorithms, followed by our other biclustering algorithms nd BiBinCons.
Declarations
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Govaert G.La classification croisee. Modulad. 1983.Google Scholar
 Law NF, Siu WC, Cheng KO, Alan WC. Identification of coherent patterns in gene expression data using an efficient biclustering algorithm and parallel coordinate visualization. BMC Bioinformatics. 2008.Google Scholar
 Prelić A, Bleuler S, Zimmermann P, Wille A, Bühlmann P, Gruissem W, et al.A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics. 2006; 22:1122–29.View ArticlePubMedGoogle Scholar
 Ihmels J, Bergmann S, Barkai N.Defining transcription modules using largescale gene expression data. Bioinformatics. 2004; 20(13):1993–2003.View ArticlePubMedGoogle Scholar
 Benny C, Richard K, Amir BD, Yakhini Z. Discovering local structure in gene expression data: The orderpreserving submatrix problem. In: Proceedings of the Sixth Annual International Conference on Computational Biology, RECOMB ’02. New York, NY, USA: ACM: 2002. p. 49–57.Google Scholar
 Santamaria R, Khamiakova T, Sill M, Theron R, Quintales L, Kaiser S, et al.biclust: Bicluster algorithms. R package. 2011.Google Scholar