A large number of methods have been developed for the analysis of microarray gene expression data, reflecting the tremendous complexity of the problem of transforming information on the expression levels of 20,000 genes into meaningful biological insights. Many microarray data analysis approaches are based on case-control study designs like comparing treated and untreated cells or matched disease and control tissues. However, the control group may be hard to define and challenging to acquire. In some cases, like with differentiating stem cells, multiple control groups would be needed in order to achieve a comprehensive understanding of the differentiation pathways. The method presented in this paper, AGEP, allows highly informative comparison of a single microarray sample against an existing reference database of annotated, previously analyzed microarray data.
The philosophy of AGEP is analogous to the sequence alignment methods in the analysis and comparison of newly sequenced DNA. These methods are highly powerful because of the availability of fully sequenced genomes and 108 million sequence records as a reference in the Genbank. The key difference between sequence-based and gene expression based methods is that the latter provides quantitative information, not just qualitative sequence identities. Therefore, we had to take into account distributions of gene expression levels in each reference tissue that are often multi-modal in nature. In the AGEP method, this was accomplished by calculating kernel density estimates for each gene in each reference tissue type, thereby generating reference data for characteristic expression profiles of all genes in all the major normal tissue types.
We feel that a simple categorization of gene expression into two or three categories (like underexpression, average and overexpression) is insufficient to capture the true behavior of genes. The way AGEP works is that we assume that the whole spectrum of expression values for a gene in a tissue reflects the true variation in vivo. Therefore, when we compare the expression value from an external sample to a reference database, we determine quantitatively how well that value fits the distribution in each reference tissue, instead of simply asking whether the gene is up- or down regulated in a direct comparison with a reference tissues, as these types of analyses are usually done.
One of the key features of the AGEP method is the tm-score. We believe that it is the best way to compare a single expression value to a host of values from any reference sample group, such as a single tissue. Unlike a single summary value (like mean or median), it is able to account for any type of expression distribution, and takes into account the observed expression range of the gene in question. It can also accommodate missing values, which is not the case for many other methods. It is also relatively robust against annotation errors as mixing two tissue types together will create a bimodal expression profile for at least some of genes and AGEP can accept that as a feature of the (mixed) tissue class whereas methods based single summary statistic would generate values that are not correct for either tissue types of the mix.
AGEP performance in finding correct tissue of origin for a set of samples was benchmarked by using both nearest-neighbor and SVM, the latter being one of the most powerful classifying engines available [27–29]. As AGEP reached at least similar performance levels as SVM, we do not anticipate that comparison to other methods would change the conclusion that AGEP's absolute accuracy in tissue identification is comparable to other key methods and adequate for most purposes.
For tissue classification purposes, tm-scores need to be evaluated in terms how well they differentiate each tissue from all the reference sample types. Transforming tm-scores to tissue specificity scores provides the necessary evaluation. The ts-score may not necessarily be the optimal method for testing the classification of the query sample against one tissue type. That being said, the high classification accuracy achieved by AGEP demonstrates that the tm-score is a good basis for comparing similarity of a single gene expression value to a reference pool.
Importantly, AGEP not only provides a metric of the sample similarities, but also defines the genes informative in comparison to all the reference tissues. This is important in order to understand the biological basis of the transcriptomic similarities. That is, rather than just asking the question "What tissues does this gene expression profile resemble?", AGEP can also answer questions like "which genes contribute to the similarity to a certain tissue?" or "what biological processes are different in the test sample as compared to the various tissues?", as evidenced by the presented case studies.
Previous methods for similar comparisons are typically based on an upfront selection of subsets of genes (gene sets or signatures) that are derived from the test samples and reference sets. Examples of conceptually similar approaches include the connectivity map [44, 45], molecular concept mapping , and the relevancy metric , which all provide the capability to link new experiments to existing ones. Selected gene sets are most informative and powerful for the purpose they were designed for and depend entirely on the identification and annotation of meaningful gene sets that may or may not be available for a particular study. Also, gene sets may not transfer well from one context to another, e.g. from one tissue to another. Other informative gene expression patterns may be missed when focusing on gene sets or molecular concepts. AGEP does not depend on a priori assumptions of subsets of genes being more informative than others and it was designed to be used for the analysis of individual samples.
The AGEP method is widely applicable, but is particularly powerful when a deep interpretation of microarray results is needed for samples for which an optimal control tissue is not available due to technical, medical or biological considerations, such as cell differentiation and stem cell research, where comparisons with multiple different cell and tissue types are needed.
When selecting the reference data, we omitted any tissue with less than six samples. Obviously, human normal tissue specimens are hard to obtain in large quantities. Therefore, five is less than optimal as a statistical lower limit, as individual samples have a huge impact on the shape of the kernel density with so few samples. As more data become available, we would suggest raising the low limit to at least 20 samples, so that each reference sample type would have the representation of the spectrum of likely expression levels.
The computational requirements for AGEP are rather heavy, as the representation of the expression distributions as density estimates requires considerable amounts of memory. With the current implementation AGEP needs be run in a server with more than 10 GB of memory, however this is largely dependent on the size of the reference database used.