Uncovering mechanisms of transcriptional regulations by systematic mining of cis regulatory elements with gene expression profiles
© Ma et al; licensee BioMed Central Ltd. 2008
Received: 11 January 2008
Accepted: 17 July 2008
Published: 17 July 2008
Contrary to the traditional biology approach, where the expression patterns of a handful of genes are studied at a time, microarray experiments enable biologists to study the expression patterns of many genes simultaneously from gene expression profile data and decipher the underlying hidden biological mechanism from the observed gene expression changes. While the statistical significance of the gene expression data can be deduced by various methods, the biological interpretation of the data presents a challenge.
A method, called CisTransMine, is proposed to help infer the underlying biological mechanisms for the observed gene expression changes in microarray experiments. Specifically, this method will predict potential cis-regulatory elements in promoter regions which could regulate gene expression changes. This approach builds on the MotifADE method published in 2004 and extends it with two modifications: up-regulated genes and down-regulated genes are tested separately and in addition, tests have been implemented to identify combinations of transcription factors that work synergistically. The method has been applied to a genome wide expression dataset intended to study myogenesis in a mouse C2C12 cell differentiation model. The results shown here both confirm the prior biological knowledge and facilitate the discovery of new biological insights.
The results validate that the CisTransMine approach is a robust method to uncover the hidden transcriptional regulatory mechanisms that can facilitate the discovery of mechanisms of transcriptional regulation.
High-throughput microarray experiments have modernized biological experiments by enabling measurements of expression levels for genes on the genome scale under different conditions. Hundreds or thousands of genes may be differentially expressed between conditions due to the effects of a variety of transcriptional factors or their co-factors. It is challenging to be able to interpret these changes in a biological context. Understanding the transcription regulation mechanisms between transcriptional factors and their target genes is one of the key ways to formulate hypotheses about the root causes of the observed changes.
Unveiling mechanisms of transcription regulation is an active bioinformatics research area. Different approaches have been proposed to discover mechanisms of transcription regulation. Bayesian network approaches have been applied  to integrate motif discovery in promoters with the analysis of gene expression data. Some approaches  split motifs and gene expression values of regulators to build a decision tree based on the combination of expression ratios of transcription factors and presence/absence of the motifs. Yet other approaches [3, 4] fit gene expression data to a linear model using weights depending on whether a transcriptional factor is an inducer or repressor. Mootha et. al.  uses a two-tailed non-parametric Mann-Whitney (Wilcoxon) rank sum test to determine significance of motifs in promoter regions. The MotifADE method  assumes that if up-regulated or down-regulated genes which contain certain transcriptional factor binding sites are co-ordinately regulated, changes in their expression levels could be explained by those transcriptional factors. On the other hand, if genes which contain the same transcriptional factor binding sites are not co-ordinately regulated, there may not be any association between genes and transcriptional factors. In particular, the MotifADE algorithm works in three steps: (1) rank genes based on differential expression between two conditions using the signal-to-noise ratio as the difference metric in descending order (the signal-to-noise ratio is used as opposed to the fold change value based on the expression level since the former also takes into account the standard deviation); (2) For each motif, identify the group of genes whose promoter regions contains the motif; and finally (3) apply the two-tailed non-parametric Mann-Whitney rank sum test to determine if these genes tend to be enriched toward the top or bottom of the ranked list (indicating association) or tend to be randomly distributed on the list (indicating no association).
In our hands, we have observed that two-tailed non-parametric Mann-Whitney rank sum tests used by MotifADE method cannot detect significances of transcriptional factors if they induce the transcription of some genes and repress the transcription of other genes at the same time (see discussion). We have therefore extended the MotifADE method to investigate up-regulated and down-regulated genes separately since a transcriptional factor may simultaneously enhance the transcription of certain genes and inhibit the transcription of other genes. We have also introduced a method to identify the synergistic effects between pairs of transcriptional factors. The CisTransMine method is applied to a mouse C2C12 differentiation dataset , where it implicates several known myogenic and cell cycle facts as well as a novel transcriptional factor binding site which regulates known target genes. These results demonstrate that the CisTransMine method is an important tool to discover unknown transcription regulation mechanisms, thus facilitating in extending biological knowledge.
Results for known transcriptional factors
Significant transcriptional factors in up-regulated genes from Day 1 to Day 2
Significant transcriptional factors in down-regulated genes from Day 1 to Day 2
Significant synergistic transcriptional factors in down-regulated genes from Day 1 to Day 2
Results for unknown transcriptional factors
Tested novel motifs with mutagenesis
Known nearby Transcriptional
factor binding sites
SREBP-1, MEF2, MEF3
This experiment demonstrated the potential of this method to successfully identify novel functional motifs. Such an approach may be extended to differential gene expression within a variety of disease-related settings and cell types, with potential relevance to disease pathway discovery.
In the post-genomics area, there is a sea of biological data including microarray experimental data. This provides an unprecedented opportunity and challenge to fully decipher the underlying biological system. One aspect of this analysis is to analyze significantly enriched pathways where coordinated but sometimes subtle expression changes are observed among genes . Though the pathway analysis provides a way to see "forests, not individual trees", it can not address the transcription regulation mechanisms which govern the observed gene expression level changes. Thus deciphering transcription regulation mechanisms help characterize the underlying biological process. Different approaches have been proposed to help decipher transcription regulation mechanisms including Bayesian networks, decision trees, and regression models. In this paper, the CisTransMine method has been implemented to identify transcriptional factors involved in biological processes through the analysis of microarray data.
Significant transcriptional factors identified by the two-tailed non-parametric
In summary, preliminary results identified the relevant transcriptional factors involved in a mouse C2C12 cell model of myogenesis, demonstrating the potential of this method to identify the transcriptional regulatory mechanisms in profiling experiments. We expect that the application of this method to other systems will yield similar results and lead to novel hypotheses regarding the roles of various transcription factors in specific biological systems.
The CisTransMine method was implemented in R, Perl, and C++ and is available upon request. The CisTransMine method was applied to a gene expression profiling experiment of mouse C2C12 skeletal muscle myoblast differentiation to myotubes. The dataset is available from NCBI GEO database .
Preparation for promoter sequences
The human, mouse and rat promoter sequences were extracted from the genome assembly as of January 2008. The location of the transcriptional start site was approximated by the first nucleotide in the RefSeq mRNA transcript sequence. For each gene, promoter sequences with respect to their transcripts were extracted according to coordinates of first exons for corresponding transcripts. For each transcript, the region from -2000 bp to +300 bp with respect to the transcriptional start site was extracted. A gene may have several different transcripts, therefore several promoters.
The promoter sequences were masked against repetitive sequences, e.g., LINEs and SINEs with the RepeatMasker program to avoid any Transfac version 11.4  matrix search hits in those repetitive regions. Then orthologous promoter sequences were aligned together with Wconsensus . The orthologous relationships were defined in the NCBI Homologene database as of March 2008. For those promoters with orthologous promoters in human, mouse and rat, a sliding window of 10 nucleotides was used and non-conserved regions were masked out where promoter sequence identities among orthologous promoter sequences had a length of less than 5 nucleotides within a 10 nucleotide window.
Annotation of promoter sequences
Human-curated transcriptional factor binding sites from the Transfac database were used to record each transcription factor and its regulated genes for human sequences. In addition, the GeneGo Metacore database version 4.6  was used to identify each transcriptional factor and its regulated genes. The Metacore database also reports whether the relationship is the activation or inhibition effect by the transcription regulation, e.g., the human P53 gene regulates 609 target genes by the transcription regulation: among these 609 genes, it transcriptionally activates 206 genes and inhibits 84 genes. Its nature of its interactions with the remaining 319 genes is not explicitly stated. In total, there are a total of 822 human transcriptional factors, 649 mouse transcriptional factors, and 386 rat transcriptional factors in our collection.
Extraction of unknown transcriptional factor binding sites
Promoter sequence regions which have been annotated as known transcriptional factor binding sites were masked out. The remaining regions contain potentially novel transcriptional factor binding sites. All possible non-degenerative conserved 8-mer and 9-mer motifs which have at least 5 identical nucleotides within a 10 nucleotide window among human, mouse and rat promoter sequences were enumerated. Their true significance would be evaluated in biological experiments.
Normalization of affymetrix genechip arrays
Affymetrix mouse 430 version 2 microarrays were used to measure gene expression values. Normalization in our analysis was carried out using the GC-RMA normalization method . Values were exponentiated (base 2) to return them to a linear scale and scaled to a 2% trimmed mean of 150. We removed probe sets which have average raw values among replicates less than 100 for both conditions.
Calculation of the moderated t statistic for each probe set
The traditional student t-test statistic is often used to assess the significance of individual probe sets between two conditions, e.g., treatment group versus control group. However, there are usually only a few replicates (usually three) within each group. Given such a small sample size, it is difficult to estimate the variance reliably. This makes the estimation of the t-statistic problematic. To address this problem, the moderated t-test  implemented in the Limma package within the Bioconductor package  is adopted to evaluate the significance of individual probe sets between the two groups. The moderated t-test assumes the same distribution for the error variance of all genes in order to estimate the variance of an individual gene with an empirical Bayes method, using posterior residual standard deviations instead of traditional standard deviations, to accommodate for the low number of replicates for each group . Up-regulated genes and down-regulated genes have positive and negative moderated t-values respectively. If a gene is represented by several probe sets, the moderated t-statistic with the highest absolute value is used to represent the moderated t- statistic for that gene.
Evaluation of the significance of a single motif
An approach using absolute values was implemented to solve this problem  where the absolute enrichment can identify important gene sets that may not be identified by two-tailed methods. The CisTransMine method is proposed to test up-regulated genes and down-regulated genes separately for statistical significance by using the one-tailed non-parametric Mann-Whitney test. For up-regulated (and down-regulated respectively) genes, the null hypothesis is that the mean of the ranks in the up-regulated (and down-regulated respectively) genes containing the motif is equal to the mean of ranks in the up-regulated (and down-regulated respectively) genes not containing the motif; the alternative hypothesis is that the mean of ranks in the up-regulated (and down-regulated respectively) genes containing the motif is greater than (less than respectively) the mean of ranks in the up-regulated (and down-regulated respectively) genes not containing the motif. Thus, significances for motifs in up-regulated genes and down-regulated genes are tested separately.
In eukaryotic genomes, a synergistic relationship is present when multiple transcriptional factors work in concert to regulate target genes, e.g., combinatorial activities of multiple transcriptional factors regulate the B cell lineage commitment and differentiation . In the CisTransMine method, synergistic relationships between two transcriptional factors are detected in a two-step process. First, the genes containing transcriptional factor A binding sites (TFA) and transcriptional factor B binding sites (TFB) in the promoter regions can be denoted by TFA ∩ TFB, which is a subset of genes containing both types of binding sites. All the genes containing transcriptional factor A binding sites but not transcriptional factor B binding sites can be denoted by TFA- TFB. All the genes containing transcriptional factor B binding sites but not transcriptional factor A binding sites can be denoted by TFB- TFA. For up-regulated (and down-regulated respectively) genes, the necessary conditions for the true synergy between two transcriptional factors to exist are that (1) one-tailed Mann Whitney rank sum test P-value between genes in the set of TFA ∩ TFB and the genes in the set of TFA- TFB is less than 0.05, (2) one-tailed Mann Whitney rank sum test P-value between genes in the set of TFA ∩ TFB and the genes in the set of TFB- TFA, is less than 0.05. If the necessary conditions are satisfied, the algorithm proceeds to the second step where the significance of the synergistic relationship between the two transcriptional factors is tested with the same method as that for the single motif with the one-tailed Mann-Whitney rank sum test.
Multiple testing correction
In order to reduce the false positive rate, multiple testing correction method must be applied to take into account that thousands of null hypotheses are tested at the same time. The multiple testing correction method we adopt is the False Discovery Rate (FDR) q-value . The FDR q-value is a measure of the rate of false discovery from the distribution of p-values. The FDR q-value method is chosen since it can balance between the specificity and the sensitivity without a priori p-value cutoff (see reference for details).
We thank Liam O'Connor and Richard Cai for reviewing the manuscript and Leah Martell for statistical advice.
- Segal E, Yelensky R, Koller D: Genome-wide Discovery of Transcriptional Modules from DNA Sequence and Gene Expression. Bioinformatics. 2003, 19 (Suppl 1): i273-282. 10.1093/bioinformatics/btg1038.View ArticlePubMedGoogle Scholar
- Middendorf M, Kundaje A, Wiggins C, Freund Y, Leslie C: Predicting Genetic Regulatory Response Using Classification. Bioinformatics. 2004, 20 (Suppl 1): i232-240. 10.1093/bioinformatics/bth923.View ArticlePubMedGoogle Scholar
- Chen KC, Wang TY, Tseng HH, Huang CYF, Kao CY: A stochastic differential equation model for quantifying transcriptional regulatory network in Saccharomyces cerevisiae. Bioinformatics. 2005, 21 (12): 2883-2890. 10.1093/bioinformatics/bti415.View ArticlePubMedGoogle Scholar
- Stephen Yeung MK, Tegnér J, Collins JJ: Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci USA. 2002, 99 (9): 6163-6168. 10.1073/pnas.092576199.View ArticleGoogle Scholar
- Mootha VK, Handschin C, Arlow D, Xie X, St Pierre J, Sihag S, Yang W, Altshuler D, Puigserver P, Patterson N, Willy PJ, Schulman IG, Heyman RA, Lander ES, Spiegelman BM: Erralpha and Gabpa/b specify PGC-1alpha-dependent oxidative phosphorylation gene expression that is altered in diabetic muscle. Proc Natl Acad Sci USA. 2004, 101 (17): 6570-6575. 10.1073/pnas.0401401101.View ArticlePubMedPubMed CentralGoogle Scholar
- Szustakowski JD, Lee J, Marrese CA, Kosinski PA, Nirmala NR, Kemp DM: Identification of novel pathway regulation during myogenic differentiation. Genomics. 2006, 87 (1): 129-138. 10.1016/j.ygeno.2005.08.009.View ArticlePubMedGoogle Scholar
- Lluís F, Ballestar E, Suelves M, Esteller M, Muñoz-Cánoves P: E47 phosphorylation by p38 MAPK promotes MyoD/E47 association and muscle-specific gene transcription. EMBO J. 2005, 24 (5): 974-84. 10.1038/sj.emboj.7600528.View ArticlePubMedPubMed CentralGoogle Scholar
- Li S, Czubryt MP, McAnally J, Bassel-Duby R, Richardson JA, Wiebel FF, Nordheim A, Olson EN: Requirement for serum response factor for skeletal muscle growth and maturation revealed by tissue-specific gene deletion in mice. Proc Natl Acad Sci USA. 2005, 102 (4): 1082-7. 10.1073/pnas.0409103102.View ArticlePubMedPubMed CentralGoogle Scholar
- Johnson PF: Molecular stop signs: regulation of cell-cycle arrest by C/EBP transcription factors. J Cell Sci. 2005, 118 (12): 2545-55. 10.1242/jcs.02459.View ArticlePubMedGoogle Scholar
- Silva JL, Giannocco G, Furuya DT, Lima GA, Moraes PA, Nachef S, Bordin S, Britto LR, Nunes MT, Machado UF: NF-kappaB, MEF2A, MEF2D and HIF1-a involvement on insulin- and contraction-induced regulation of GLUT4 gene expression in soleus muscle. Mol Cell Endocrinol. 2005, 240 (1–2): 82-93. 10.1016/j.mce.2005.05.006.View ArticlePubMedGoogle Scholar
- Tabach Y, Milyavsky M, Shats I, Brosh R, Zuk O, Yitzhaky A, Mantovani R, Domany E, Rotter V, Pilpel Y: The promoters of human cell cycle genes integrate signals from two tumor suppressive pathways during cellular transformation. Mol Syst Biol. 2005, 1: 0022-10.1038/msb4100030.View ArticlePubMedGoogle Scholar
- Wang IC, Chen YJ, Hughes D, Petrovic V, Major ML, Park HJ, Tan Y, Ackerson T, Costa RH: Forkhead box M1 regulates the transcriptional network of genes essential for mitotic progression and genes encoding the SCF (Skp2-Cks1) ubiquitin ligase. Mol Cell Biol. 2005, 25 (24): 10875-94. 10.1128/MCB.25.24.10875-10894.2005.View ArticlePubMedPubMed CentralGoogle Scholar
- Barre B, Perkins ND: A cell cycle regulatory network controlling NF-kappaB subunit activity and function. EMBO J. 2007, 26 (23): 4841-55. 10.1038/sj.emboj.7601899.View ArticlePubMedPubMed CentralGoogle Scholar
- Curtis RK, Oresic M, Vidal-Puig A: Pathways to the analysis of microarray data. Trends Biotechnol. 2005, 23 (8): 429-35. 10.1016/j.tibtech.2005.05.011.View ArticlePubMedGoogle Scholar
- Mouse C2C12 differentiation time course dataset. [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11415]
- Matys V, Fricke E, Geffers R, Gößling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Münch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003, 31 (1): 374-378. 10.1093/nar/gkg108.View ArticlePubMedPubMed CentralGoogle Scholar
- Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics. 1999, 15 (7–8): 563-77. 10.1093/bioinformatics/15.7.563.View ArticlePubMedGoogle Scholar
- Ekins S, Nikolsky Y, Bugrim A, Kirillov E, Nikolskaya T: Pathway mapping tools for analysis of high content data. Methods Mol Biol. 2007, 356: 319-50.PubMedGoogle Scholar
- Irizarry RA, Wu JZ, Jaffee HA: Comparison of Affymetrix GeneChip expression measures. Bioinformatics. 2007, 22 (7): 789-794. 10.1093/bioinformatics/btk046.View ArticleGoogle Scholar
- Smyth GK: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology. 2004, 3 (1): Article 3-10.2202/1544-6115.1027.View ArticleGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-10.1186/gb-2004-5-10-r80.View ArticlePubMedPubMed CentralGoogle Scholar
- Saxena V, Orgill D, Kohane I: Absolute enrichment: gene set enrichment analysis for homeostatic systems. Nucleic Acids Res. 2006, 34 (22): e151-10.1093/nar/gkl766.View ArticlePubMedPubMed CentralGoogle Scholar
- Nutt SL, Kee BL: The transcriptional regulation of B cell lineage commitment. Immunity. 2007, 26 (6): 715-25. 10.1016/j.immuni.2007.05.010.View ArticlePubMedGoogle Scholar
- Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA. 2003, 100 (16): 9440-9445. 10.1073/pnas.1530509100.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.