ExpressionData - A public resource of high quality curated datasets representing gene expression across anatomy, development and experimental conditions
© Zimmermann et al.; licensee BioMed Central Ltd. 2014
Received: 21 May 2013
Accepted: 13 August 2014
Published: 31 August 2014
Reference datasets are often used to compare, interpret or validate experimental data and analytical methods. In the field of gene expression, several reference datasets have been published. Typically, they consist of individual baseline or spike-in experiments carried out in a single laboratory and representing a particular set of conditions.
Here, we describe a new type of standardized datasets representative for the spatial and temporal dimensions of gene expression. They result from integrating expression data from a large number of globally normalized and quality controlled public experiments. Expression data is aggregated by anatomical part or stage of development to yield a representative transcriptome for each category. For example, we created a genome-wide expression dataset representing the FDA tissue panel across 35 tissue types. The proposed datasets were created for human and several model organisms and are publicly available at http://www.expressiondata.org.
Baseline or reference data are important for the analysis and interpretation of experiments and for testing methods and algorithms. Two main types of baseline data exist: a) measurements that serve as negative controls within a given experiment, and b) experiments that were carried out for the purpose of creating a collection of default states. In the first type, data from control samples (typically unperturbed samples) are necessary to test the effect of individual factors in an experiment or to calibrate a technology or treatment. In the second type, baseline datasets typically aim at compiling a large collection of data points representative for development, anatomy, cancer or response to a particular condition. The main objective of the second type is to provide comparative value for improved interpretation or for verification. In the present work, we consider the second type of reference data.
The nature of what are considered to be appropriate baseline conditions depends on community standards and on experimental design, but frequently they target unperturbed, healthy samples from a “wild-type” genetic background, or they represent a collection of perturbed states as reference for one’s own data. Several distinct types of baseline gene expression data exist: tissue, cancer or development profiles (usually absolute expression values), perturbations and diseases (relative values) and time courses and dose responses (absolute or relative values), or a combination of these spatial, temporal and response profiles. Examples of baseline data sets that have been published include the GNF human tissue panel , The Cancer Genome Atlas , the rat liver and kidney HESI baseline dataset , the NOWAC Postgenome Study , and the Arabidopsis AtGenExpress datasets for abiotic stress , plant development , or hormonal and chemical responses . New technological platforms are also frequently assessed by generating data from control samples, resulting in baseline experimental datasets that can subsequently be used as a reference on these platforms. For example, several normalization methods for Affymetrix expression arrays have been benchmarked using the Latin square spike-in data from Affymetrix . Further examples of spike-in datasets are the Golden Spike , Platinum Spike  and Agilent Spike  experiments to assess single channel or dual channel microarrays, and a spike-in dataset used to map the mammalian transcriptome using high-throughput RNA sequencing .
Most publicly available datasets originate from a single experiment with few independent biological replicates and performed in a particular experimental setting. The expression profiles of these samples therefore represent gene expression in that particular context, but it is not a priori clear whether these results are generally reproducible in other contexts. Furthermore, it is not immediately visible whether similar results have already been found previously or if they are novel. To verify these two questions requires the availability of comparable experiments. Comparing one’s own expression profiles with reference datasets composed of a variety of different experimental conditions allows interpreting similarities globally or at the level of individual gene signatures. It is clear that the composition and robustness of the reference datasets used will have a major impact on the outcome of such comparisons. It is therefore essential to create robust reference datasets containing representative expression values for individual biological contexts. Robustness and improved statistical power can be achieved through intensity-level integration of microarray data . The quality of these profiles also depends on the quality and granularity of the sample annotations. Here, we present a collection of reference datasets based on average expression values generated from many samples originating from a similar biological context. The annotations of each sample were manually verified and their profiles were compared to other samples having the same annotations. The reasoning for this approach is that the expression of a set of genes in a specifically defined condition is reproducible and, therefore, similar data sets can be combined to create a representative profile. This concept, called meta-profiling, has been introduced in GENEVESTIGATOR and has proven to be highly useful . The approach works particularly well for tissue types and cancers, since they are the main determinant of transcript population . It allows creating rich and robust datasets from the bulk of research data that is publicly available for various applications, in particular for confirmation, classification or interpretation of one’s own experimental results.
Construction and content
The datasets made available at ExpressionData represent a carefully chosen subset of platforms and conditions from the complete GENEVESTIGATOR database. The criteria for selecting a particular condition were defined as follows:
Anatomy: each tissue type is represented by data from at least two independent experiments and at least 30 replicates;
Development: all expression data available for each category is aggregated into an average vector per category.
Datasets representing spatial expression
The knowledge about the spatial expression characteristics of genes is crucial for understanding their function and regulation. Representative vectors of expression in tissue types were processed from a very large number of samples carried out in at least two independent laboratories under a variety of conditions. Since all datasets are normalized to allow integrating datasets from multiple sources, a large number of samples from different tissues can be compiled into a single data set where each row represents one gene and each column represents one tissue.
To demonstrate the biological validity of tissue meta-profiles, we carried out a principle component analysis of a mouse tissue expression dataset which had been summarized from more than 3000 Affymetrix array datasets available in GENEVESTIGATOR. The results show a clear grouping of tissues that are functionally related (see Figure 2). The first principle component separates distinctly all central nervous system tissues from all other body parts. The second principle component groups all other tissues into clusters of anatomical parts that have a common origin or physiology. For example, a variety of muscle tissues form a distinct cluster that is located close to heart and heart ventricle tissues. These results confirm previous findings on comparing human and mouse tissues based on datasets that were normalized differently and in which tissue samples are represented individually . Differences between the individual vectors therefore primarily reflect fundamental biological processes that are associated with each tissue type. The anatomical datasets available at http://www.expressiondata.org have a carefully selected coverage of tissue types, each of them represented by a single vector of expression.
Datasets representing developmental expression
Meta-profile data for anatomy and development provide an excellent basis for genomic data interpretation. For many biological questions, however, it is desirable to look beyond the spatio-temporal aspects of gene expression. Many organisms undergo time-related regulation, especially circadian. The ExpressionData resource therefore contains further datasets of particular biological relevance. Two of them are presented here.
Datasets with biological oscillations
Datasets with time-courses
Time-course datasets are useful to identify trends and to measure the rapidity of response to a given perturbation or developmental process. This type of data is also interesting for the development of methods to identify such trends. Here, we show an example of a time-course dataset that, as opposed to the circadian dataset, was used to identify genes having no circadian response.
The response to cold stress is a highly conserved defense mechanism by which plants protect their viability . The cold stress response can be an attractive mechanism to modulate the expression and production of recombinant protein in plants without usage of chemical inducers, but via relatively inexpensive and controllable stimuli.
The dataset considered in this section was generated by harvesting leaves of 6 week old greenhouse-grown tobacco (Nicotiana tabacum) plants from the relatively cold-intolerant Flue-Cured variety K326. Leaves were quickly chilled for 10 minutes in a blast chiller at a temperature between 0-5°C and monitored to avoid frost. Then, the samples were incubated at the same temperature for 5 hours or 24 hours before being frozen in liquid nitrogen. Two harvesting times, one in the morning (7:30am) and one in the afternoon (1:00pm) were chosen in order to avoid circadian effect. Two sets of control leaves were harvested together with the cold-treated samples and frozen immediately in liquid nitrogen (Time 0). Three biological replicates were prepared for each time point. Unlike in the previously described datasets focused on biological oscillations, the objective of this type of studies is to find genes that are continuously induced by cold treatment over the time period of 24 hours independently of harvesting time of the seedlings (i.e. biological oscillation between morning and afternoon do not interfere with the gene inductions). The raw data from this experiment are available in Gene Expression Omnibus (GEO) under accession Nr. GSE44938.
Microarray data experiments such as this may lead to the discovery of regulative elements which serve as powerful tools for expression of commercially valuable recombinant proteins in plants, and which are not circadian-sensitive. They may also lead to the identification of new unknown genes which could be targeted for selection of cold tolerant varieties.
The datasets presented here are examples of curated and aggregated datasets of gene expression covering a broad range of biological conditions. We have created reference data sets that can be used for
Interpreting results from other studies. The created expression compendia, such as for anatomical parts, are representative for individual biological states and allow comparisons with expression data obtained in other experiments.
Finding novel gene regulatory modules and networks from aggregated datasets containing a high diversity of tissues and developmental stages.
Testing bioinformatics methods or algorithms with manually curated and biologically relevant data.
The results shown here and on the http://www.expressiondata.org website demonstrate the biological relevance of such datasets. In fact, the results in Figure 2 and examples from the http://www.expressiondata.org website demonstrate that the compiled data contain robust and representative estimates for individual biological states such as tissue types or stages of development. In particular, a high concordance was found in the tissue transcriptomes between human and mouse by comparing aggregated expression data for each organism across the same set of tissues. While the ExpressionData resource is expected to grow as new datasets are processed and made available, it does not aim at having complete sets of data for every organism.
ExpressionData is a new source of tested, high quality, manually curated, globally normalized and aggregated expression data ready for use in a variety of data interpretation and verification tasks. It is expected to simplify the search for robust and high quality datasets and to provide a set of reference data for comparison. Its content is open access and freely available for download at http://www.expressiondata.org.
We acknowledge support of our research from the European Union (EU Framework Program 6, AGRON-OMICS (LSHG-CT-2006-037704)), the Swiss Commission for Technology and Innovation (CTI, grants 9428.1 PFLS-LS and 12396.1 PFLS-LS), from ETH Zurich and from Philip Morris International.
- Su AI, Cooke MP, Ching KA, Hakak Y, Walker JR, Wiltshire T, Orth AP, Vega RG, Sapinoso LM, Moqrich A, Patapoutian A, Hampton GM, Schultz PG, Hogenesch JB:Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci USA. 2002, 99: 4465-4470. 10.1073/pnas.012025199.View ArticlePubMedPubMed CentralGoogle Scholar
- Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, Chang K, Creighton CJ, Davis C, Donehower L, Drummond J, Wheeler D, Ally A, Balasundaram M, Birol I, Butterfield SN, Chu A, Chuah E, Chun HJ, Dhalla N, Guin R, Hirst M, Hirst C, Holt RA, Jones SJ, Lee D, Li HI:The cancer genome Atlas pan-cancer analysis project. Nat Genet. 2013, 45 (10): 1113-1120. 10.1038/ng.2764.View ArticlePubMedPubMed CentralGoogle Scholar
- Boedigheimer M, Wolfinger R, Bass M, Bushel P, Chou J, Cooper M, Corton JC, Fostel J, Hester S, Lee J, Liu F, Liu J, Qian HR, Quackenbush J, Pettit S, Thompson K:Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories. BMC Genomics. 2008, 9: 285. 10.1186/1471-2164-9-285.View ArticlePubMedPubMed CentralGoogle Scholar
- Dumeaux V, Olsen KS, Nuel G, Paulssen RH, Brresen-Dale AL, Lund E:Deciphering normal blood gene expression variation the NOWAC postgenome study. PLoS Genet. 2010, 6: e1000873. 10.1371/journal.pgen.1000873. [http://dx.doi.org/10.1371],View ArticlePubMedPubMed CentralGoogle Scholar
- Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D’Angelo C, Bornberg-Bauer E, Kudla J, Harter K:The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007, 50 (2): 347-363. 10.1111/j.1365-313X.2007.03052.x.View ArticlePubMedGoogle Scholar
- Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D, Lohmann JU:A gene expression map of Arabidopsis thaliana development. Nat Genet. 2005, 37: 501-506. 10.1038/ng1543.View ArticlePubMedGoogle Scholar
- Goda H, Sasaki E, Akiyama K, Maruyama-Nakashita A, Nakabayashi K, Li W, Ogawa M, Yamauchi Y, Preston J, Aoki K, Kiba T, Takatsuto S, Fujioka S, Asami T, Nakano T, Kato H, Mizuno T, Sakakibara H, Yamaguchi S, Nambara E, Kamiya Y, Takahashi H, Hirai MY, Sakurai T, Shinozaki K, Saito K, Yoshida S, Shimada Y:The AtGenExpress hormone and chemical treatment data set: experimental design, data evaluation, model data analysis and data access. Plant J. 2008, 55 (3): 526-542. 10.1111/j.1365-313X.2008.03510.x.View ArticlePubMedGoogle Scholar
- Latin Square data for Expression Algorithm Assessment. [http://www.affymetrix.com/support/technical/sample_data/datasets.affx],
- Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS:Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol. 2005, 6: R16. 10.1186/gb-2005-6-2-r16.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu Q, Miecznikowski JC, Halfon MS:Preferred analysis methods for Affymetrix GeneChips. II. An expanded, balanced, wholly-defined spike-in dataset. BMC Bioinformatics. 2010, 11: 285. 10.1186/1471-2105-11-285.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu Q, Miecznikowski JC, Halfon MS:A wholly defined Agilent microarray spike-in dataset. Bioinformatics. 2011, 27: 1284-1289. 10.1093/bioinformatics/btr135.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B:Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5: 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Turnbull AK, Kitchen RR, Larionov AA, Renshaw L, Dixon JM, Sims AH:Direct integration of intensity-level data from Affymetrix and Illumina microarrays improves statistical power for robust reanalysis. BMC Med Genomics. 2012, 5: 35. 10.1186/1755-8794-5-35.View ArticlePubMedPubMed CentralGoogle Scholar
- Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L, Widmayer P, Gruissem W, Zimmermann P:Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv Bioinformatics. 2008, 2008: 420747-View ArticlePubMedPubMed CentralGoogle Scholar
- Prasad A, Kumar SS, Dessimoz C, Bleuler S, Laule O, Hruz T, Gruissem W, Zimmermann P:Global regulatory architecture of human, mouse and rat tissue transcriptomes. BMC Genomics. 2013, 14: 716. 10.1186/1471-2164-14-716.View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A:NCBI GEO: archive for functional genomics data sets–10 years on. Nucleic Acids Res. 2011, 39: D1005-1010. 10.1093/nar/gkq1184.View ArticlePubMedGoogle Scholar
- Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, Kurbatova N, Lukk M, Malone J, Mani R, Pilicheva E, Rustici G, Sharma A, Williams E, Adamusiak T, Brandizi M, Sklyar N, Brazma A:ArrayExpress update–an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011, 39: D1002-1004. 10.1093/nar/gkq1040.View ArticlePubMedGoogle Scholar
- Zheng-Bradley X, Rung J, Parkinson H, Brazma A:Large scale comparison of global gene expression patterns in human and mouse. Genome Biol. 2010, 11: R124. 10.1186/gb-2010-11-12-r124.View ArticlePubMedPubMed CentralGoogle Scholar
- Yadav SK:Cold stress tolerance mechanisms in plants. A review. Agronomy Sustainable Dev. 2010, 30 (3): 515-527. 10.1051/agro/2009050.View ArticleGoogle Scholar
- Martin F, Bovet L, Cordier A, Stanke M, Gunduz I, Peitsch MC, Ivanov NV:Design of a tobacco exon array with application to investigate the differential cadmium accumulation property in two tobacco varieties. BMC Genomics. 2012, 13: 674. 10.1186/1471-2164-13-674.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhou M, Shen C, Wu L, Tang K, Lin J:CBF-dependent signaling pathway: a key responder to low temperature stress in plants. Crit Rev Biotechnol. 2011, 31 (2): 186-192. 10.3109/07388551.2010.505910.View ArticlePubMedGoogle Scholar
- Wang Y, Hua J:A moderate decrease in temperature induces COR15a expression through the CBF signaling cascade and enhances freezing tolerance. Plant J. 2009, 60 (2): 340-349. 10.1111/j.1365-313X.2009.03959.x.View ArticlePubMedGoogle Scholar
- Mao D, Chen C:Colinearity and Similar Expression Pattern of Rice DREB1s Reveal Their Functional Conservation in the Cold-Responsive Pathway. PloS one. 2012, 7 (10): e47275. 10.1371/journal.pone.0047275.View ArticlePubMedPubMed CentralGoogle Scholar
- Knox AK, Dhillon T, Cheng H, Tondelli A, Pecchioni N, Stockinger EJ:CBF gene copy number variation at frost resistance-2 is associated with levels of freezing tolerance in temperate-climate cereals. TAG Theor Appl Genet. 2010, 121: 21-35. 10.1007/s00122-010-1288-7.View ArticlePubMedGoogle Scholar
- Fernandez-Caballero C, Rosales R, Romero I, Escribano MI, Merodio C, Sanchez-Ballesta MT:Unraveling the roles of CBF1, CBF4 and dehydrin 1 genes in the response of table grapes to high CO2 levels and low temperature. J Plant Physiol. 2012, 169 (7): 744-748. 10.1016/j.jplph.2011.12.018.View ArticlePubMedGoogle Scholar
- SIDDIQUA M, NASSUTH A:Vitis CBF1 and Vitis CBF4 differ in their effect on Arabidopsis abiotic stress tolerance, development and gene expression. Plant Cell Environ. 2011, 34 (8): 1345-1359. 10.1111/j.1365-3040.2011.02334.x.View ArticlePubMedGoogle Scholar
- Fujimoto SY, Ohta M, Usui A, Shinshi H, Ohme-Takagi M:Arabidopsis ethylene-responsive element binding factors act as transcriptional activators or repressors of GCC box–mediated gene expression. Plant Cell Online. 2000, 12 (3): 393-404. 10.1105/tpc.12.3.393.View ArticleGoogle Scholar
- Zhang G, Chen M, Chen X, Xu Z, Guan S, Li LC, Li A, Guo J, Mao L, Ma Y:Phylogeny, gene structures, and expression patterns of the ERF gene family in soybean (Glycine max L.). J Exp Botany. 2008, 59 (15): 4095-4107. 10.1093/jxb/ern248.View ArticleGoogle Scholar
- Zhang Z, Huang R:Enhanced tolerance to freezing in tobacco and tomato overexpressing transcription factor TERF2/LeERF2 is modulated by ethylene biosynthesis. Plant Mol Biol. 2010, 73 (3): 241-249. 10.1007/s11103-010-9609-4.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.