Biclustering of Gene Expression Data by Correlation-Based Scatter Search

Background The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. Methods Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. Results The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database.


Background
DNA microarray technology measures the gene expression level of thousand of genes under multiple experimental conditions [1]. After several preprocessing steps wellknown as low level microarray analysis a microarray can be represented as a numerical matrix where rows correspond to different genes and columns to experimental conditions. The row vector of a gene is called the expression pattern of the gene and a column vector is called the expression profile of the condition. High level microarray analysis uses data mining techniques in order to analyze the huge volume of all this biological information [2]. In this field, an important problem is to discover transcription factors which determine that a group of genes are co-expressed. Thus, the goal of Biclustering techniques is to discover groups of genes with the same behavior under a specific group of conditions. Biclustering was considered in the seventies and it was proven to be a NP-hard problem [3]. It can be also found in the literature with other names such as Subspace Clustering [4] or Co-clustering [5]. Several algorithms have been proposed and recently published reviews can be found in [6][7][8]. In the context of microarray analysis, biclustering was firstly considered by Cheng and Church in 2000. Cheng and Church (CC) algorithm [9] is a greedy iterative search method and consists in building a bicluster adding or removing rows or columns iteratively, thus, improving its quality which is measured with the Mean Squared Residue (MSR). The MSR is based on the sum of the squared residues which measure how adequate each expression value is, in comparison with the rest of values of the bicluster (see [9] for more details). The FLOC algorithm [10] improved the method presented in [9] by obtaining a set of biclusters simultaneously and by incorporating the processing of missing values. In [11] an iterative hierarchical clustering is separately applied to each dimension and biclusters are built by means of the combination of the obtained results for each dimension. In the ISA algorithm [12] a simple linear model for gene expression is used assuming a normally distributed expression level for each gene or condition in a specific way. SAMBA [13] executes an exhaustive biclusters enumeration by means of a bipartite graphbased model and later a greedy approach adds or removes nodes in order to find maximum weight subgraphs. In [14] an exhaustive biclusters enumeration algorithm is proposed with the OPSM algorithm. Spectral biclustering [15] uses techniques from linear algebra, concretely eigenvectors calculus, to identify bicluster structures from the input data. The Plaid Model [16] is a statistical modelling approach which represents the input matrix as a superposition of layers where each layer corresponds to a bicluster. The BiMax algorithm [17] discretizes the data set by using binary values and it is recursively applied until a submatrix with only one value is detected. Geometrical characterization of biclusters are used for discovering patterns [18,19]. These techniques use image processing in order to search for hyperplanes which represent biclusters. There is a group of biclustering algorithms based on metaheuristics such as evolutionary approaches [20,21], multiobjective evolutionary approaches [22,23], Simulated Annealing [24], Particle Swarm Optimization [25], greedy randomized adaptive search [26], Estimation of Distribution Algorithms [27] or Memetics Algorithms [28]. All these algorithms used the MSR as a part of their fitness function. Although the MSR is commonly used as quality criterion, some interesting patterns from a biological point of view might not be detected with such measure. The MSR is effective for recognizing biclusters with shifting patterns but not some patterns with scaling trends, in spite of representing quality patterns. A group of genes has a shifting pattern when the expression values vary in the addition of a fixed value for all the genes. A group of genes has a scaling pattern when the expression values vary in the multiplication of a fixed value for all the genes. Aguilar-Ruiz [29] proved that the MSR is not a good measure in order to discover patterns in data when the variance of gene values is high, that is, when the genes present scaling patterns.
Other algorithms are designed to work with time series gene expression data. In this kind of data, the biclusters can be restricted to those with contiguous columns. This constraint becomes the biclustering problem in a tractable problem. CCC-Biclustering [30] finds coherent biclusters with maximal contiguous columns in linear time. First, the algorithm discretizes the matrix and then it works with string processing techniques based on suffix trees. This algorithm is not robust as regards errors in gene values due to the discretization process, the microarray experiment, etc. e-CCC-Biclustering [31] is a robust extension of the CCC-Biclustering where approximate expression patterns such as scaling patterns can be found and several measures to compute the committed errors in these patterns are proposed.
The gene expression level under a set of conditions can be seen as values of a discrete random variable. Thus, the linear dependency between two genes can be studied by using the correlation coefficient between two random variables. In this paper, this fact has motivated the use of the proposed measure based on correlations among genes [32,33]. Several correlation-based measures have been proposed in [4,34,35]. In [4] biclusters are characterized as hyperplanes in a high dimensional space using the definition of correlation and, therefore, the problem is transformed into the search for groups of points embedded in hyperplanes. In [34] the correlation coefficient is used for forming biclusters with a greedy algorithm. In [35] an enumeration algorithm based on a tree structure for biclustering is presented and it uses an evaluation function based on the Spearman's rank correlation.
A Scatter Search algorithm for biclustering is presented in this paper. Scatter Search is a population-based method that emphasizes systematic processes against random procedures. Thus, the generation of the initial population is not random but a generation method based on diversification [36] is used to generate a set of diverse initial solutions. Moreover, Scatter Search includes an improvement method with the aim of exploiting the diversity provided by the generation and combination method. The linear correlation among genes is included in the fitness function to evaluate the quality of biclusters in the Scatter Search, which improves the localization of shifting and scaling patterns.

Description of the algorithm
Scatter Search [36] is a population-based optimization metaheuristic which has recently been applied to combinatorial and nonlinear optimization problems. Optimization algorithms based on populations are search procedures where a set of individuals that represent trial solutions evolve in order to find optimal solutions of the problem. Scatter Search uses strategies to diversify and intensify the search in order to avoid local minima and to find quality solutions and, on the opposite to other evolutionary heuristics, it emphasizes systematic processes against random procedures.
Basically, the optimization process consists in the evolution of a set called Reference Set. This set is initially built with the best solutions from the population, according to the value of their fitness function, and the most scattered ones from the population regarding the previous best solutions. This set is updated by using the Combination Method and the Improvement Method until it does not change. When the Reference Set is stable, that is, after applying the combination and improvement methods it contains the same solutions that the reference set at the previous iteration, then it is rebuilt again. That is, the building of the Reference Set is based on quality and diversity, but its updating is only guided by quality. Thus, diversity is introduced in the evolutionary process when the initial population is generated and, mainly, when the reference set is rebuilt at each step. The search intensification is due to the improvement method where the solutions are improved by exploiting the knowledge of the problem.
The pseudocode of the proposed Scatter Search for biclustering is presented in Algorithm 1. The Scatter Search process is repeated numBi times where numBi is the number of biclusters to be found and the best solution of the reference set is stored in a set called Results for each iteration. Thus, the Results set is formed by numBi biclusters and it is the output of the Algorithm 1. The Scatter Search mainly consists in a diversification generation method to generate the initial population, a combination method to create new offspring and an improvement method to intensify the search. All theses steps of the Scatter Search are detailed as follows.
Algorithm 1 Scatter Search Algorithm for Biclustering INPUT microarray M , number of biclusters to be found numBi, maximum number of iterations numIter, size of the initial population and size S of the reference set.
OUTPUT Set Results with numBi biclusters. begin num 0, Results θ while (num <numBi) do Initialize population P P Improvement Method (P ) //Building Reference Set R 1 S/2 best biclusters from P (according to the fitness function) R 2 S/2 most scattered biclusters, regarding R 1 , from P \ R 1 (according to a distance). RefSet

Results
the best one from RefSet num num + 1 end while end

Initialization phase
Formally, a microarray is a real matrix M composed of N genes and L conditions. The element (i, j) of the matrix means the level of expression of gene i under the condition j. A bicluster B is a submatrix of the matrix M composed of n ≤ N rows or genes and l ≤ L columns or conditions. Biclusters are encoded by binary strings of length N + L. Each of the first N bits of the binary string is related to the genes and the remaining L bits to the conditions. For example, the string 0010110000|01100 represents a bicluster of a microarray with ten genes, {g i } 1≤i≤10 , and five conditions, {c j } 1≤j≤5 . This string encodes the bicluster composed by the genes g 3 , g 5 and g 6 and the conditions c 2 and c 3 .
The initial population is generated with solutions as diverse as possible. Thus, the diversification generation method [36] takes a binary string, x i with i = 1, . . . , n where n is the number of bits, as a seed solution and generates solutions ′ x i by following the rule: where ⌊n/h⌋ is the largest integer less or equal than n/h and h is an integer less than n/5. All the remaining bits of x′ are equal to that of x.
After generating all the posible solutions with that seed, if more solutions are needed, the diversification generation method is applied again by using the last solution as new seed.

Biclusters Evaluation: Fitness Function
The most difficult biclusters to find are those that present jointly shifting and scaling patterns. The aim of this work is to discover this type of biclusters. Two genes show a shifting and scaling pattern if they are described from (2).
Consequently, two genes with shifting and scaling patterns are linearly dependent and therefore a measure based on correlations can be a good fitness function to find biclusters with these patterns.
The correlation coefficient between two variables X and Y measures the grade of linear dependency between them. It is defined by: where cov(X, Y ) is the covariance of the variables X and Y, x and y are the mean of the values of the variables X and Y and s X and s Y are the standard deviations of X and Y , respectively. Given a bicluster B composed by N genes, B = [g 1 , . . . , g N ], the average correlation of B, r(B), is defined as follows, where r(g i , g j ) is the correlation coefficient between the gene i and the gene j. Due to r(g i , g j ) = r(g j , g i ), therefore, only N 2 ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ elements have been considered. Figure 1 presents a bicluster with lowly-correlated genes and a bicluster with highlycorrelated genes. It can be observed that the bicluster with perfect shifting and scaling patterns has an average correlation of 1 while that the bicluster without patterns has an average correlation close to 0 (concretely 0.003).
In this work, biclusters with highly-correlated genes and high volume are preferred. Therefore, the fitness function used to evaluate the quality of biclusters is defined by: where nG and nC are the number of genes and conditions of the bicluster B, respectively, M 1 and M 2 are penalty factors to control the volume of the bicluster B, and s r is the standard deviation of the values r(g i , g j ) from (4). The standard deviation is included in order to avoid that the value of the average correlation can be high for a bicluster and this bicluster can contain several non-correlated genes with the remaining ones of the bicluster. Best biclusters are those with the lowest value for the fitness function. Thus, it has been considered (1 − r(B)) to identify biclusters with highly-correlated genes.
Moreover, this measure is robust to noise since genes showing noise but with shifting patterns can present a high correlation although scaling patterns are not involved.

Improvement Method
Scatter Search uses improvement methods when the solutions have to fulfill some constraints or simply to improve them in order to intensify the search process. This method depends on the problem under study and usually it consists in classical local searches for continuous optimization problems.
The goal of this work is to find biclusters with shifting and scaling patterns. Thus, biclusters with positively-correlated genes are only searched for. Therefore, the proposed improvement method aims at extracting positively-correlated genes either from biclusters of the initial population or from biclusters obtained by the combination method. The pseudocode of the improvement method is presented in the Algorithm 2. Figure 2 presents a bicluster composed by four genes: three highly-correlated genes and a gene negatively correlated with the remaining. The average correlation for this bicluster is equal to 0.0083 and after applying the improvement it is equal to 1. Thus, the volume of biclusters is decreased by removing the negatively-correlated genes but the average correlation of the new bicluster will be greater than that of the original bicluster when the improvement method is applied.

Building of the reference set
The reference set is initially built with the best solutions, according to the value of their fitness function, and the most scattered ones from the initial population regarding the previous best solutions. The Hamming distance is used to measure the distance among biclusters in this work. After getting the stability of the reference set in the updating process, it is rebuilt to introduce diversity in the search process. Thus, the reference set is rebuilt with the best biclusters from the updated reference set, according to the fitness function, and the most distant solutions from the initial population regarding the previously chosen best solutions.
The initial population has to be updated too in the evolutionary process by removing solutions which have already been considered in the building or rebuilding of the reference set. When the initial population is empty, a new population is created by using the diversification generation method previously explained in Section 2.1.

Combination method and reference set updating
New solutions are introduced in the search process by the combination method. Two solutions are combined by using an uniform crossover operator and a new one is generated. All pairs of biclusters in the reference set are combined, generating thus, S * (S -1)/2 new biclusters where S is the size of the reference set. This crossover operator generates randomly a mask and the child is composed of values from the first parent when there is a 1 in the mask, and from the second parent when there is a 0.
After combining all pairs of biclusters, the best solutions from the joining of the previous reference set and the new solutions are chosen. Hence, best solutions according to the value of their fitness function remain in the reference set.

Experiments
The proposed algorithm has been applied to three real data sets in order to study its performance. The first data set (Yeast) is the yeast Saccharomyces cerevisiae cell cycle expression, with 2884 genes and 17 experimental conditions presented by Cho [37]. The second one (Lymphoma) is the human B-cells lymphoma expression data with 4026 genes and 96 conditions [38]. These two data sets are available in [9] where original data were processed. The third data set (GaschYeast) is the yeast Saccharomyces cerevisiae Stress conditions expression provided by Gasch [39] with 2993 genes and 173 conditions. This data set was used in [17] where can be downloaded as supplementary data.
The inner parameters of the Algorithm 1 are as follows: 20 for the maximum number of iterations of the Scatter Search, 10 for the size of the reference set, 200 for the number of solutions of the initial population and 100 for the number of biclusters to be found for each run.   Figure 3 and 4 present the four biclusters for Yeast and Lymphoma data set, respectively, which are reported in Table 1. Figure 5  biclusters from GaschYeast data set. The biclusters bi1-GaschYeastN1, bi1-GaschYeastN10, bi1-GaschYeastN11 and bi1-GaschYeastN25 in Figure 5 have been obtained for values M 1 = 1 and M 2 = 1 and the biclusters bi2-GaschYeastN1, bi2-GaschYeastN4, bi2-GaschYeastN9 and bi2-GaschYeastN27 in Figure 6 for values M 1 = 10 and M 2 = 10. It can be noted that the greater the penalty values are, the greater the volume of the obtained biclusters is. The motivation for taking the values of the parameters M 1 = M 2 = 1 is to find biclusters with a low number of genes in order to show visually the shifting and scaling patterns. However, the main goal is to find groups of genes sharing the same GO terms, therefore, it is more adequate to search for biclusters with a high number of genes. Thus, parameters M 1 = M 2 = 10 have been considered to achieve biclusters with a higher volume.

Discussion
Four biclusters represented in Figure 3 have a high value for the average correlation (see Table 1). It can be observed that shifting and scaling patterns can clearly be appreciated in all of them. Most of papers which use the MSR as the fitness function consider that a bicluster is good for Yeast data set if its MSR is less than 300 [9,21]. This value is data set dependent because it depends on the mean and the standard deviation of the values of the expression matrix. From this point of view, biYeastN15, biYeastN21 and biYeastN24 are good biclusters but biYeastN40 is not a good bicluster due to its high value for the average correlation. Note that the value for the gene variance in biYeastN40 is high compared to that of other biclusters.  Four biclusters represented in Figure 4 for Lymphoma data set show a group of genes with similar behavior and high values for the average correlation (see Table 1). However, these biclusters are not considered good for some authors because the MSR are higher than 1200 [9,21]. This value is data set dependent too as it was previously commented for Yeast data set. Note that the bicluster with lowest value for MSR is biLymphomaN54 (1292.6), and however this bicluster has the lowest value for the average correlation (0.82). It was proved in [29] that MSR is not precise enough in order to discover shifting and scaling patterns. Biclusters with these patterns with high values for the gene variance are not detected by the algorithms that use the MSR as fitness function. Results reported for theses cases make evident this situation. Figure 5 and 6 show biclusters for GaschYeast data set. It can be observed the effect of the penalty parameters on the number of genes and conditions. The higher values for M 1 and M 2 , the higher volume is. From a geometrical point of view all the results present genes with similar behavior. For example, scaling patterns can be very clearly observed in bi1-GaschYeastN25 since the shape of genes between conditions 6 and 8 is the same although all genes increase their expression levels with different intensity. Moreover, the value for the average correlation shows biclusters from GaschYeast with highly-correlated genes. The values for MSR and gene variance vary in a different range of values to the other two data set due to the previous preprocessing.
It should be noted that all biclusters present shifting and scaling patterns, and therefore, a high value for the average correlation. Moreover, the standard deviation is low, that is, the correlation coefficients of each pair of genes have similar values and close values to the average correlation of the bicluster. Therefore, all biclusters with a high average correlation found by the proposed Scatter Search do not contain non-correlated genes.

Comparative analysis
The performance of the Algorithm 1 has been compared with biclustering methods such as CC [9], OPSM [14], ISA [40], BiMax [17], xMotifs [41] and Samba [13] for the GaschYeast data set and CCC-Biclustering [30] for Yeast data set. Also, random biclusters have been generated as naive reference method. Following the methodology in [17] the performance of all algorithms is evaluated biologically with the percentage of biclusters enriched by any Gene Ontology Consortium (GO) category at different levels of significance. GO [39] is used to investigate if a group of genes belonging to a bicluster presents significant enrichment with respect to a specific GO term. There are different tools to analyze GO term enrichment. The AGO [42] tool recently published has been used to study the percentage of significant biclusters obtained by the different algorithms. The enrichment of each group of genes with respect to a specific GO term is established by the p-value. A bicluster is said to be overrepresented in a functional category if its p-value is small. The comparison criterion among several algorithms is the percentage of overrepresented biclusters in one or more GO annotation. Figure 7 represents the percentage of enrichment biclusters for each method in which one or several GO terms are overrepresented for different levels of significance the proportion of biclusters significantly enriched by any GO Biological Process category for SScorr11 and SScorr1010 is over 30%, for CC is over 21%, for OPSM over 17%, for BiMax 2% and 0% for the rest. It can be observed that SScorr1010 improves the results of the rest of the methods for small levels of significance except to the CC when p = 0.001 (for instance, see the most restrictive level of significance p = 0.001 for the p-value). However, both Scatter Search algorithms obtained a percentage of significant biclusters greater than CC for p = 0.005 and p = 0.01 and the CC presents a percentage of significant biclusters greater than SScorr11 when p-value ranges from p = 0.05 to p = 5. This is due to the volume of the biclusters since it is easier to find functional enrichment from large groups of genes than from small groups. Table 2 presents information about the size of biclusters obtained by the different methods. Note that biclusters obtained by the CC algorithm have more genes that biclusters for obtained by the algorithms based on Scatter Search. SScorr1010 finds biclusters with more genes than SScorr11 and therefore it improves the results of CC for all levels of significance from p = 0.005 to p = 5. The rest of methods find a less percentage of biclusters   with the p-values specified than the proposed method although OPSM presents good results for high levels of significance (p >0.05).
The comparison is not an easy task because the number of biclusters, their size or what kind of patterns are found are very different for each method. Table 2 presents the number of biclusters for each method, the average and standard deviation for the number of genes and the number of conditions (standard deviation in brackets) and the volume. The standard deviation shows how the size of different biclusters obtained by each method varies. For example CC, OPSM and SAMBA find biclusters with a high number of genes and biclusters with only a reduced group of genes due to the high standard deviation of the number of genes. In order to establish a more restrictive criterion, Figure 7 can be reformulated with the concept of bicluster overrepresented in a GO term as follows. The percentage of enriched biclusters is reported after filtering the biclusters which have less than ten genes in each GO category or have a study fraction less than 50%, that is, there is not more than half of the genes in the bicluster that share the same function in the category [42]. Figure 8 presents the comparison for all methods with this new definition of enriched bicluster. Note that SScorr11 and SScorr1010 obtain the best biclusters for small values of p-value (from p = 0.001 to p = 0.01) but not for high values from p = 0.05 to p = 5 where OPSM and Samba present the best results. It can be noted that there is not any significant bicluster for ISA and BiMax algorithms. Figure 9 presents a comparison between algorithms based on Scatter Search (SScorr11 and SScorr1010 with values M 1 = M 2 = 1 and M 1 = M 1 = 10 respectively) and CCC-Biclustering [30] for Yeast data set. Biclusters for the CCC-Biclustering have been obtained using the tool BiGGEsTS [43] and results for all biclusters (CCC-Bi) and for only 100 selected biclusters have been reported. It can be observed that SScorr11 and SScorr1010 obtain a percentage of enriched biclusters greater than that of the CCC-Bi. This fact was expected as SScorr11 and SScorr1010  Figure 8 Comparison of different Biclustering algorithms with hard definition of enrichment bicluster. Comparison of different Biclustering methods from GaschYeast data set with a restrictive definition of enriched biclusters: percentage of enriched biclusters by GO Biological Process category for each method at different significance levels (from p = 0.001% to p = 5%) by setting the allowed minimum number of genes per each GO category to 10 and the study fraction greater than 50%. obtain 100 biclusters and the CCC-Bi obtains 14412 due to this exhaustive method discovers all maximal biclusters in time series gene expression data. Note that if only 100 biclusters of CCC-Bi are considered (the 100 first biclusters that the algorithm reports) the CCC-Bi obtains better results than SScorr11 and SScorr1010 for levels of significance from 0.001 to 0.01 and similar results for p ≥ 0.05.

Biological study
AGO tool [42] has been used to study if a bicluster is composed by a group of genes overrepresented in some GO Biological Processes. The biological study has been focused on the biclusters shown in Figure 3, Figure 4 and Figure 5 whose information is reported in Table 1.
The analysis of biYeastN15 identifies the GO process (GO:0006412) with p-value 5.81e − 006. This process is known as translation and it is related with the process in which amino acids are built in the ribosome using mRNA. In this bicluster, 5 genes of the 7 genes are related with this process.
Other example is the bicluster biLymphomaN1 which identifies the GO term (GO:0016887) known as ATPase activity with p-value 0.0069. This process is related with Catalysis of the reaction of ATP molecules.
The study for biclusters from GaschYeast data set discovers the process (GO:0006412) called translation for most of the 8 biclusters reported. However, other interesting results are obtained as for example in the bicluster bi1GaschYeastN11 where the GO process (GO:0042254) called ribosome biogenesis and assembly, with pvalue 1.38e − 007, which is related with the formation of ribosomas and the transport to the sites of protein synthesis has been found.

Conclusions
In this paper, a Scatter Search for finding biclusters from gene expression data has been presented. The proposed Scatter Search has used as merit function to evaluate biclusters a measure based on correlations among genes, with the aim of obtaining biclusters with shifting and scaling patterns. Moreover an improvement method, which consist in removing negatively-correlated genes from biclusters, has been incorporated to intensify the search. Percentage of Enriched Biclusters % Yeast p=0.001% p=0.005% p=0.01% p=0.05% p=0.1% p=0.5% p=1% p=5% Figure 9 Comparison of the propsed algorithm with CCC-Bi algorithm. Comparison between the proposed algorithm based on Scatter Search and CCC-Biclustering [30] for Yeast data set: percentage of enriched biclusters by GO Biological Process category for each method at different significance levels. (pvalues from p = 0.001% to p = 5%).
The proposed algorithm has been tested with three real data sets: a data set related to yeast cell cycle (Yeast), a data set related to a collection of different stress conditions over Yeast (GaschYeast) and another one related to human B-cells lymphoma (Lymphoma). A group of biclusters composed by genes with shifting and scaling patterns has been discovered some of which can not be detected using the MSR as it was proved in [29]. A comparison using Gene Ontology with other six methods has been presented and a good performance of the proposed algorithm has been observed.
Future works will focus on some improvements for the proposed algorithm with regard to the overlapping among genes and to the fitness function.