A microarray experiment may result in hundreds of differentially expressed genes that are subject to interpretation and further analysis. As analysing these lists gene-by-gene is tedious and error prone, the genes in the lists are routinely annotated using Gene Ontology (GO) with an aim to identify statistically significant biological processes or pathways [1]. However, statistical analysis of GO annotations can produce a very large number of significantly enriched or down-regulated biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses.

The GO project provides a species-independent controlled vocabulary for describing gene products (an RNA or protein product encoded by a gene) in terms of their biological processes, cellular components and molecular functions [1]. The GO annotations are carried out by curators of several bioinformatics databases, so the GO database is constantly updated. The ontology defines terms that are linked together to form a directed acyclic graph. Gene products are annotated with a number of ontology terms. Annotation with a given term also implies annotation with all ancestors of the term.

In this study we present methodology and software to cluster genes based on their biological functionality using GO annotations. Integral part of the methodology is the ability to rapidly compute pair-wise distances between the gene annotation similarities.

Two approaches to gene similarity computation are graph structure -based (GS) and information content -based (IC) measures. GS-based methods use the hierarchical structure of GO in computing gene similarity. IC-based methods additionally consider the *a priori* probabilities, or information contents, of GO terms in a reference gene set. IC-based measures have been found to perform better than pure graph-based measures [2, 3].

Czekanowski-Dice similarity [4] is a GS-based method. Distance of genes *G*
_{1} and *G*
_{2} is defined as

d({G}_{1},{G}_{2})=\frac{\#(GO({G}_{1})\Delta GO({G}_{2}))}{\#(GO({G}_{1})\cup GO({G}_{2}))+\#(GO({G}_{1})\cap GO({G}_{2}))},

where Δ is the symmetric set difference, # is the number of elements in a set and *GO*(*G*
_{
i
}) is the set of *GO* annotations for gene *G*
_{
i
}. Similarity can be defined as 1 - *d*(*G*
_{1}, *G*
_{2}).

In Kappa statistics [5], each gene is represented as a binary vector (*g*
_{1},...,*g*
_{
N
}), where *g*
_{
i
}is 1 if the gene is annotated with the GO term *g*
_{
i
}and 0 otherwise. *N* is the total number of GO terms under consideration.

Similarity of genes *G*
_{1} and *G*
_{2} is defined as

{K}_{{G}_{1},{G}_{2}}=\frac{{O}_{{G}_{1},{G}_{2}}-{A}_{{G}_{1},{G}_{2}}}{1-{A}_{{G}_{1},{G}_{2}}},

where {O}_{{G}_{1},{G}_{2}} represents observed co-occurrence of GO terms and {A}_{{G}_{1},{G}_{2}} represents random co-occurrence. {O}_{{G}_{1},{G}_{2}} is the relative frequency of agreeing locations in the two binary vectors, i.e., locations that are either both 0 or both 1. {A}_{{G}_{1},{G}_{2}} is the expected relative frequency of such locations if the binary vectors were random, taking into account the observed probabilities of 0's and 1's.

The following discussion considers IC-based similarity measures. The information content of a GO term is computed by the frequency of the term occurring in annotations; a rarely used term contains a greater amount of information. Probability for observing a term *t* is defined as p(t)=\frac{\text{Freq}(\text{t})}{\text{MaxFreq}}, where MaxFreq is the maximum frequency of all terms [6]. The information content for a term *t* is given as *IC*(*t*) = -log_{2}
*p*(*t*). Probabilities can be estimated from a corpus of annotations, such as the Gene Ontology database.

Several related similarity metrics are based on the most informative common ancestor (MICA) of two GO terms and were introduced in the context of GO by Lord et al. [7]. To compute the semantic similarity between terms *t*
_{1} and *t*
_{2}, we first find the most informative common ancestor *A* of *t*
_{1} and *t*
_{2}, i.e., *A* is a term that is an ancestor of both *t*
_{1} and *t*
_{2} and has the maximum *IC* among common ancestors *CommonAnc*(*t*
_{1,}
*t*
_{2}) of the terms. Now, the Resnik similarity [8] is defined as

Several other measures are defined that also take the information contents of *t*
_{1} and *t*
_{2} into account. The Lin measure [9] is defined as

{\text{Sim}}_{Lin}({t}_{1},{t}_{2})=\frac{2IC(A)}{IC({t}_{1})+IC({t}_{2})}.

(1)

Jiang and Conrath [10] define a semantic distance metric as

The corresponding similarity measure for *d*
_{
JC
}(*t*
_{1}, *t*
_{2}) [6] is given by

{\text{Sim}}_{JC}({t}_{1},{t}_{2})=\frac{1}{{d}_{JC}({t}_{1},{t}_{2})+1}.

Finally, the Relevance measure [11] that combines Lin's and Resnik's measures is defined as

{\text{Sim}}_{Rel}({t}_{1},{t}_{2})=\underset{t\in CommonAnc({t}_{1},{t}_{2})}{\mathrm{max}}\frac{2\mathrm{log}p(t)(1-p(t))}{\mathrm{log}p({t}_{1})+\mathrm{log}p({t}_{2})}=\frac{2IC(A)(1-p(A))}{IC({t}_{1})+IC({t}_{2})}.

The MICA-based measures can be modified to take into account so called disjunctive ancestor terms [6]. Two ancestors *a*
_{1} and *a*
_{2} of a term *t* are disjunctive if there are independent paths from *a*
_{1} to *t* and from *a*
_{2} to *t*. Such ancestors represent distinct interpretations of the term *t*. In the GraSM enhancement, all common disjunctive ancestors of terms *t*
_{1} and *t*
_{2} are considered when computing Sim(*t*
_{1,}
*t*
_{2}) [6]. GraSM modifies the computation of *IC*(*A*) and can be applied to the Resnik, Lin and Jiang-Conrath measures.

After computing the pair-wise term similarities, the next step in MICA-based measures is to calculate the similarity between genes *G*
_{1} and *G*
_{2}. This can be done in several ways and our package supports three most commonly used methods. In the two simplest methods, the maximum or the mean of pair-wise GO term similarities between annotation sets of *G*
_{1} and *G*
_{2} is used as the similarity value [12]. That is, when *G*
_{1} is annotated with terms *t*
_{1},...,*t*
_{
n
}and *G*
_{2} with terms {{t}^{\prime}}_{1},\mathrm{...},{{t}^{\prime}}_{m}, pair-wise term similarities form an *n* × *m* matrix **S**. Now, Sim_{
gene
}(*G*
_{1}, *G*
_{2}) is the maximum or the mean of the matrix. In the third method, similarity is defined as Sim_{
gene
}(*G*
_{1}, *G*
_{2}) = max{*rowScore, columnScore*} [11], where

\begin{array}{ccc}rowScore=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\underset{1\le j\le m}{\mathrm{max}}{S}_{ij}}& \text{and}& columnScore=\frac{1}{m}{\displaystyle \sum _{j=1}^{m}\underset{1\le i\le n}{\mathrm{max}}{S}_{ij}}.\end{array}

In addition to MICA- and GraSM-based measures, we have implemented the cosine similarity and SimGIC measures. In cosine similarity [13], each gene *G* is represented as a vector (*w*
_{1}, *w*
_{2},...,*w*
_{
N
}), where each *w*
_{
i
}is *IC*(*t*
_{
i
}) if *G* is annotated with the term *t*
_{
i
}, or 0 otherwise. *N* is the total number of GO terms under consideration. Similarity of genes *G*
_{1} and *G*
_{2} is defined as \frac{{G}_{1}\cdot {G}_{2}}{\left|{G}_{1}\right|\left|{G}_{2}\right|}, where · is the dot product and |*v*| is the vector norm. This is the cosine of the angle between vectors *G*
_{1} and *G*
_{2}. In the SimGIC (Graph Information Content) measure [3], similarity of genes *G*
_{1} and *G*
_{2} is defined as

\frac{{\Sigma}_{t\in GO({G}_{1})\cap GO({G}_{2})}IC(t)}{{\Sigma}_{t\in GO({G}_{1})\cup GO({G}_{2})}IC(t)},

where *GO*(*G*
_{
i
}) gives the GO annotations of gene *G*
_{
i
}. SimGIC is a hybrid of GS- and IC-based methods.

Given similarities between the genes we use hierarchical clustering with heat map presentation to visualise both semantic similarities and expression levels of the genes. First, similarity measures are converted to distances using *d*(*x*, *y*) = 1 - Sim(*x*, *y*) when the similarity range is [0, 1] (Czekanowski-Dice, Kappa, Lin, Jiang-Conrath, Relevance, Cosine, SimGIC) or using *d*(*x*, *y*) = 1/(Sim(*x*, *y*) + 1) when the range is [0, ∞) (Resnik). Second, a hierarchical clustering algorithm is run using the converted distances. The results are visualised as a dendrogram and heat map. The dendrogram is generated using the GO semantic distances and allows identification of clusters containing genes contributing to the same biological process. For each cluster we compute statistical significance with a permutation test. The heat map illustrates gene expression data obtained from microarray analysis. Thus, the visualisation framework integrates both functional gene expression levels to biological processes, which facilitates interpretation of the gene expression analysis results.