TRIQ: a new method to evaluate triclusters

Gutiérrez-Avilés, David; Giráldez, Raúl; Gil-Cumbreras, Francisco Javier; Rubio-Escudero, Cristina

doi:10.1186/s13040-018-0177-5

Research
Open access
Published: 06 August 2018

TRIQ: a new method to evaluate triclusters

David Gutiérrez-Avilés ORCID: orcid.org/0000-0002-6292-9873¹,
Raúl Giráldez¹,
Francisco Javier Gil-Cumbreras¹ &
…
Cristina Rubio-Escudero²

BioData Mining volume 11, Article number: 15 (2018) Cite this article

3244 Accesses
8 Citations
Metrics details

Abstract

Background

Triclustering has shown to be a valuable tool for the analysis of microarray data since its appearance as an improvement of classical clustering and biclustering techniques. The standard for validation of triclustering is based on three different measures: correlation, graphic similarity of the patterns and functional annotations for the genes extracted from the Gene Ontology project (GO).

Results

We propose TRIQ, a single evaluation measure that combines the three measures previously described: correlation, graphic validation and functional annotation, providing a single value as result of the validation of a tricluster solution and therefore simplifying the steps inherent to research of comparison and selection of solutions. TRIQ has been applied to three datasets already studied and evaluated with single measures based on correlation, graphic similarity and GO terms. Triclusters have been extracted from this three datasets using two different algorithms: TriGen and OPTricluster.

Conclusions

TRIQ has successfully provided the same results as a the three single evaluation measures. Furthermore, we have applied TRIQ to results from another algorithm, OPTRicluster, and we have shown how TRIQ has been a valid tool to compare results from different algorithms in a quantitative straightforward manner. Therefore, it appears as a valid measure to represent and summarize the quality of tricluster solutions. It is also feasible for evaluation of non biological triclusters, due to the parametrization of each component of TRIQ.

Peer Review reports

Background

Analysis of data structured in 3D manner is becoming an essential task in fields such as biomedical research, for instance in experiments studying gene expression data taking time into account. There is a lot of interest in this type of longitudinal experiments because they allow an in-depth analysis of molecular processes in which the time evolution is important, for example, cell cycles, development at the molecular level or evolution of diseases [1]. Therefore, the use of specific tools for data analysis in which genes are evaluated under certain conditions considering the time factor becomes necessary. In this sense, triclustering [2] appears as a valuable tool since it allows for the assessment of genes under a subset of the conditions of the experiment and under a subset of time points.

The evaluation of solutions obtained by triclustering algorithms is challenging by the fact that there is no ground truth to describe triclusters present in real 3D data. In literature, the standard measures to evaluate tricluster solutions are based on three areas as can be seen in the triclustering publications [3–7]. First, correlation measures such as Pearson [8] or Spearman [9]. Second, graphic validation of the patterns extracted based on the graphic representation, i.e., how similar the genes from a tricluster are based on the graphic representation of the genes across conditions and time points. Third, functional annotations extracted from the Gene Ontology project (GO) [10] for the genes in the tricluster.

However, we consider that providing a single evaluation measure capable of combining the information from the three aforementioned sources of validation is a neccesary task. Therefore, in this work we propose TRIQ, a validation measure which combines the three previously proposed validation mechanisms (correlation, graphic validation and functional annotation of the genes).

The application of clustering and biclustering techniques to gene expression data has been broadly studied in the literature [11, 12]. Although triclustering is the result from the natural evolution of the clustering and biclustering techniques, is still a very recent concept. However, nowadays, these techniques are arousing a great interest from the scientific community, which has caused a notable increase of the number of researches focused on finding new triclustering approaches. This section is to provide a general overview of triclustering published in literature. We particularly focus on the validation methods applied to assess the quality of the triclusters obtained.

In 2005, Zhao and Zaki [3] introduced the triCluster algorithm to extract patterns in 3D gene expression data. They presented a measure to assess triclusters’s quality based on the symmetry property. They validated their triclusters based on their graphical representation and Gene Ontology (GO) results. g-triCluster, an extended and generalized version of Zhao and Zaki’s proposal, was published one year later [4]. The authors claimed that the symmetry property is not suitable for all patterns present in biological data and proposed the Spearman rank correlation [9] as a more appropriate tricluster evaluation measure. They also showed validation results based on GO.

An evolutionary computation proposal was made in [13]. The fitness function defined is a multi-objective measure which tries to optimize three conflicting objectives: clusters size, homogeneity and gene-dimension variance of the 3D cluster. The tricluster quality validation was based on GO. LagMiner was introduced in [6] to find time-lagged 3D clusters, what allows to find regulatory relationships among genes. It is based on a novel 3D cluster model called S₂D₃ Cluster. They evaluated their triclusters on homogeneity, regulation, minimum gene number, sample subspace size and time periods length. Their validation was based on graphical representation and GO results. Hu et al. presented an approach focusing on the concept of Low-Variance 3-Cluster [5], which obeys the constraint of a low-variance distribution of cell values. This proposal uses a different functional enrichment tool called CLEAN [14], which uses GO as one of their components. The work in [7] was focused on finding Temporal Dependency Association Rules, which relate patterns of behavior among genes. The rules obtained are used to represent regulated relations among genes. They also validated their triclusters based on their graphical representation and GO results.

Tchagang et al. [15] proposed OPTricluster, a triclustering algorithm which obtains 3D short time series gene expression datasets by applying a statistical methodology. In this case, the authors carried out an in-depth biological validation based on GO, but they tested the robustness of OPTricluster to noise using the Adjusted Rand Index (ARI) [16], which also was used by aforementioned g-tricluster.

In 2013, two new and very interesting approaches were proposed. On the one hand, the δ−TRIMAX algorithm [17], which applies a variant of the MSR adapted to 3D datasets and yields triclusters that have a MSR score below a threshold δ. This algorithm has a version based on evolutionary multi-objective optimization, named EMOA−δ−TRIMAX [18], which aims at optimizing the use of δ−TRIMAX by adding the capabilities of evolutionary algorithms to retrieve overlapping triclusters. On the other hand, OAC-Triclustering was also proposed by Gnatyshak et al. in [19]. In the following years, the authors developed improvements and extensions of this algorithm [20–22].

More recent works have extended the capabilities of the tricluster algorithms by combination of several approaches. Thereby, Liu et al. [23] mixed fuzzy clustering and fuzzy biclustering algorithms in order to expands them to support 3D data and they used the F-Measure and Entropy as criteria to evaluate the performance. Also, Kakati et al. [24] combined parallel biclustering and distributed triclustering approaches to obtain improvements on the computational cost. In this work, the authors use a quality measure based on shifting and scaling patterns [25] to optimize the triclusters obtained.

Most of the methods studied base the quality of the triclusters on the graphic representation or on metrics aimed at measuring diverse characteristics of such representation. From a biological point of view, the standard for validation of triclusters quality is based on GO functional annotations.

Methods

This section presents the TRIQ (TRIcluster Quality) validation measure [26], a novel method to evaluate the quality of triclusters extracted from gene expression datasets.

From an overall perspective, TRIQ takes into account the three principal components of a tricluster, i.e. the genes, experimental conditions and time points, in order to measure its quality from three approaches: the level of biological notoriety of the cluster (biological quality), the graphic quality of the patterns of the genes in the tricluster (graphic quality), and the level of correlation of the genes in the tricluster by means of the Pearson [8] and the Spearman [9] indexes. Therefore, TRIQ is composed by a combination of four indexes: BIOQ (BIOlogical Quality), GRQ (GRaphic Quality), PEQ (PEarson Quality) and SPQ (SPearman Quality).

In Eq. 1 we define TRIQ as the weighted sum of each of the four aforementioned terms. Therefore, four associated weights must be defined: the weight for BIOQ, denoted as W_bio; the weight for GRQ, denoted as W_gr; the weight for PEQ, denoted as W_pe; and the weight for SPQ, denoted as W_sp.

$$ \begin{aligned} TRIQ(TRI) &= \frac{1}{W_{bio}+W_{gr}+W_{pe}+W_{sp}} * \left[ W_{bio}*BIOQ(TRI) \right.\\ &\left.+ W_{gr}*GRQ(TRI) + W_{pe}*PEQ(TRI) + W_{sp}*SPQ(TRI)\right] \\ \end{aligned} $$

(1)

This is a general definition of TRIQ. In order to obtain a TRIQ index as balanced as possible among the four quality indexes BIOQ, GRQ, PEQ, and SPQ we performed an exhaustive testing procedure with well known datasets. Several combinations of values of BIOQ, GRQ, PEQ, and SPQ were tested, and in Fig. 1 we show the results obtained.

We see that that the value of TRIQ is slightly directly dependent on the weights related to correlation, PEQ, and SPQ. This is due to the fact that these values rank in the [0-1] interval, being usually high, from 0.7 to 1. The value of TRIQ has a higher level of dependence to the graphical quality, GRQ, and reverse strong dependence to the biological quality, BIOQ, due to the fact that BIOQ ranks in low values, usually around 10⁻³ to 10⁻⁵. Based on this experiments, we have configured the TRIQ measure with the weights showed in Eq. 2 in order to obtain a balanced value of TRIQ.

$$ W_{bio} = 0.5, W_{gr} = 0.4, W_{pe} = 0.05, W_{sp} = 0.05 $$

(2)

Next, we describe in depth each of the terms involved in the TRIQ measure.

Correlation measures: PEQ and SPQ

The correlation measures involved in TRIQ are Pearson’s PEQ [8] and Spearman’s SPQ [9] correlations. They have been chosen since they are the standard correlation measures and they are widely used in literature [4]. The correlation provides a numerical estimation of the dependence among the genes, conditions and times in the tricluster solutions.

Given a tricluster TRI, we compute PEQ and SPQ by the following mechanism. Given the subset of genes (see Eq. 3a), conditions (see Eq. 3b) and time stamps (see Eq. 3c), we obtain a value of expression for each combination gene, condition and time. For instance, for a tricluster consisting of four genes, two conditions and three time points, we have twenty four expression values. We then compute the Pearson correlation for each pair of values, and compute PEQ as the average of the absolute values to avoid negative and positive correlations canceling each other (see Eq. 4). Furthermore, for this measure we do not care if the correlation is positive or negative between values, we only want to know the level of correlation. The SPQ value is the equivalent using the Spearman correlation (see Eq. 5).

$$\begin{array}{@{}rcl@{}} TRI_{G} &=& <g_{0}, g_{1}, \ldots, g_{G}>\end{array} $$

(3a)

$$\begin{array}{@{}rcl@{}} TRI_{C} &=& <c_{0}, c_{1}, \ldots, c_{C}>\end{array} $$

(3b)

$$\begin{array}{@{}rcl@{}} TRI_{T} &=& <t_{0}, t_{1}, \ldots, t_{T}>\end{array} $$

(3c)

$$ PEQ(TRI) = \frac{\sum_{i=0,j=0}^{\#exp} \left|Pearson_{i \neq j}\left(exp_{i}, exp_{j}\right)\right|}{\#pairs\;of\;exp} $$

(4)

$$ SPQ(TRI) = \frac{\sum_{i=0,j=0}^{\#exp} \left|Spearman_{i \neq j}\left(exp_{i}, exp_{j}\right)\right|}{\#pairs\;of\;exp} $$

(5)

with exp representing the expressions in each tricluster TRI.

Graphical validation: GRQ

The GRQ member of Eq. 1 measures the graphical quality of the tricluster. This graphical quality of a tricluster is a quantitative representation of a qualitative measure: how homogeneous the members of the tricluster are. This method is widely used in literature for visual validation of the results by means of graphically representing the triclusters on their three components: genes, conditions and time points [3, 6, 7].

The GRQ index is described in Eq. 6. This measure is defined based on the normalization of the angle value given by MSL. The Multi SLope (MSL) evaluation function was defined in [27] and, given a tricluster TRI, provides a numerical value of the similarity among the angles of the slopes formed by each profile shaped by the genes, conditions, and times of the tricluster.

$$ GRQ(TRI) = 1 - \frac{MSL(TRI)}{2\pi} $$

(6)

The MSL measure considers the three graphical views of a tricluster, also defined in [27]: TRI_gct, TRI_gtc, and TRI_tgc. These three terms are generally defined as TRI_xop, with the expression levels of the tricluster represented in the Y axis, x represents the tricluster component in the X axis (genes or time points), o represents the lines plotted in the graph (genes, conditions or time lines) and p the type of facets or panels represented (time points or conditions). We can observe an example of the TRI_tgc view of a tricluster with the genes g₁, g₄, g₇ and g₁₀, the experimental conditions c₂, c₅ and c₈ and the time points t₀, t₂, t₁₁ in Fig. 2 and see how each line or gene forms a set of angles (two for this particular example) defined by each time point in the X axis for every panel or experimental condition. Thus, MSL measures the differences among the angles formed by every series traced on each of the three graphic representations taking into account TRI_gct, TRI_gtc, and TRI_tgc. A near to zero value of MSL implies a better graphical quality of a tricluster therefore, according to GRQ formulation in Eq. 6, a tricluster is graphically better the smaller the value of MSL.

Biological validation: BIOQ

The BIOQ member of Eq. 1 measures the biological quality of the tricluster. Specifically, BIOQ uses the genes (TRI_G) of the input tricluster TRI to compute this index. As you can see in Eq. 7, the biological quality of a tricluster TRI is defined as the biological significance, SIG_bio, of the set of genes TRI_G divided by the S_max value.

$$ BIOQ(TRI) = \frac{SIG_{bio}\left(TRI_{G}\right)}{S_{max}} $$

(7)

The SIG_bio and S_max elements of the BIOQ index have been designed in order to represent, by means of a quantitative score, the value of the Gene Ontology analysis of the genes that compose the measured tricluster.

The Gene Ontology Project (GO) [10] is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases, besides identifying the annotated terms, performs the statistical analysis for the over-representation of those terms, also providing a statistical significance p-value. However, it is also important to take into account how deep in the ontology the terms are annotated, with the deeper terms being more specific than the superficial ones [28]. The SIG_bio and S_max elements are calculated based on the GO analysis that identifies, for a set of genes in a tricluster, the terms listed in each of the three available ontologies: biological processes, cellular components, and molecular functions. This GO analysis is performed with the software Ontologizer [29].

The computation of SIG_bio consists on counting how many terms of the annotated genes of the tricluster in the GO analysis are in a particular intervals of p-value. Table 1 represents the ah-hoc designed system of intervals of p-value and scoring system. The intervals and the scoring system are defined in Eq. 8 where for a given level, Inter_l is defined by a weight value w_l for the level, and by the lower and upper bounds (inf_l and sup_l, respectively), being an open-closed p-values interval (Eq. 8a). The set of existing LV consists of all levels with Inf_l smaller or equal to a minimum p-value, th. For each interval of each level Inter_l, the weight value w_l is defined in Eq. 8c; Inf_l is defined in Eq. 8d, and sup_l is defined in Eq. 8e.

Table 1 Biological significance intervals

TRIQ: a new method to evaluate triclusters

Abstract

Background

Results

Conclusions

Background

Methods

Correlation measures: PEQ and SPQ

Graphical validation: GRQ

Biological validation: BIOQ

Results

Yeast elutriation dataset

Elutriation M S R 3D experiment

Elutriation LSL experiment

Elutriation MSL experiment

Elutriation OPT experiment

Elutriation summary

Mouse GDS4510 dataset

GDS4510 M S R 3D experiment

GDS4510 LSL experiment

GDS4510 MSL experiment

GDS4510 OPT experiment

GDS4510 summary

Human GDS4472 dataset

GDS4472 M S R 3D experiment

GDS4472 LSL experiment

GDS4472 MSL experiment

GDS4472 OPT experiment

GDS4472 summary

Conclusions and discussion

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Consent for publication

Competing interests

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us

Elutriation M S R _3D experiment

GDS4510 M S R _3D experiment

GDS4472 M S R _3D experiment