Preparation for promoter sequences
The human, mouse and rat promoter sequences were extracted from the genome assembly as of January 2008. The location of the transcriptional start site was approximated by the first nucleotide in the RefSeq mRNA transcript sequence. For each gene, promoter sequences with respect to their transcripts were extracted according to coordinates of first exons for corresponding transcripts. For each transcript, the region from -2000 bp to +300 bp with respect to the transcriptional start site was extracted. A gene may have several different transcripts, therefore several promoters.
The promoter sequences were masked against repetitive sequences, e.g., LINEs and SINEs with the RepeatMasker program to avoid any Transfac version 11.4 [16] matrix search hits in those repetitive regions. Then orthologous promoter sequences were aligned together with Wconsensus [17]. The orthologous relationships were defined in the NCBI Homologene database as of March 2008. For those promoters with orthologous promoters in human, mouse and rat, a sliding window of 10 nucleotides was used and non-conserved regions were masked out where promoter sequence identities among orthologous promoter sequences had a length of less than 5 nucleotides within a 10 nucleotide window.
Annotation of promoter sequences
Human-curated transcriptional factor binding sites from the Transfac database were used to record each transcription factor and its regulated genes for human sequences. In addition, the GeneGo Metacore database version 4.6 [18] was used to identify each transcriptional factor and its regulated genes. The Metacore database also reports whether the relationship is the activation or inhibition effect by the transcription regulation, e.g., the human P53 gene regulates 609 target genes by the transcription regulation: among these 609 genes, it transcriptionally activates 206 genes and inhibits 84 genes. Its nature of its interactions with the remaining 319 genes is not explicitly stated. In total, there are a total of 822 human transcriptional factors, 649 mouse transcriptional factors, and 386 rat transcriptional factors in our collection.
Extraction of unknown transcriptional factor binding sites
Promoter sequence regions which have been annotated as known transcriptional factor binding sites were masked out. The remaining regions contain potentially novel transcriptional factor binding sites. All possible non-degenerative conserved 8-mer and 9-mer motifs which have at least 5 identical nucleotides within a 10 nucleotide window among human, mouse and rat promoter sequences were enumerated. Their true significance would be evaluated in biological experiments.
Normalization of affymetrix genechip arrays
Affymetrix mouse 430 version 2 microarrays were used to measure gene expression values. Normalization in our analysis was carried out using the GC-RMA normalization method [19]. Values were exponentiated (base 2) to return them to a linear scale and scaled to a 2% trimmed mean of 150. We removed probe sets which have average raw values among replicates less than 100 for both conditions.
Calculation of the moderated t statistic for each probe set
The traditional student t-test statistic is often used to assess the significance of individual probe sets between two conditions, e.g., treatment group versus control group. However, there are usually only a few replicates (usually three) within each group. Given such a small sample size, it is difficult to estimate the variance reliably. This makes the estimation of the t-statistic problematic. To address this problem, the moderated t-test [20] implemented in the Limma package within the Bioconductor package [21] is adopted to evaluate the significance of individual probe sets between the two groups. The moderated t-test assumes the same distribution for the error variance of all genes in order to estimate the variance of an individual gene with an empirical Bayes method, using posterior residual standard deviations instead of traditional standard deviations, to accommodate for the low number of replicates for each group [20]. Up-regulated genes and down-regulated genes have positive and negative moderated t-values respectively. If a gene is represented by several probe sets, the moderated t-statistic with the highest absolute value is used to represent the moderated t- statistic for that gene.
Evaluation of the significance of a single motif
The CisTransMine method extends the MotifADE framework to identify significant transcriptional factor binding sites enriched between two microarray conditions. MotifADE uses a two-tailed non-parametric Mann-Whitney rank sum U statistic to evaluate the significance of a motif. Specifically, for each motif, t-statistics for all the genes are divided into two groups: one group containing t-statistics for genes having the motif of interest in their promoter region and the other group for genes not having the motif in their promoter regions. The null hypothesis is that there is no difference between the means of the ranks of these two sets of t-statistics; the alternative hypothesis is that the means of the ranks of these two sets are not equal, i.e., genes containing the motif are either up-regulated or down-regulated (Figure 4).
In the case where a transcriptional factor may enhance the transcription of certain genes and repress the transcription of other genes at the same time, the two-tailed Mann-Whitney test might obscure such contexts. Under this situation, a two-tailed Mann-Whitney test cannot detect the significance of that motif since the two-tailed Mann-Whitney test calculates for a given motif, the rank sum for all genes having that motif regardless of up-regulated genes, down-regulated genes, and non-regulated genes. If there are an approximately equal number of up- and down-regulated genes with a particular motif, the statistical significance of the up-regulated genes will be more or less cancelled out by the statistical significance of the down-regulated genes. As a result the motif contained in those genes will be computed to be statistically insignificant. For example, in Figure 5, Motif 1 and Motif 3 would have the same p-values with the two-tailed Mann-Whitney test since only the t-value 0.9 is important and all other t-values from Motif 1 or Motif 3 are symmetric with respect to 0 contributing the same to the rank sum as does t-value 0 even though Motif 1 is more significant than Motif 3, as there are several genes containing motif 1 that are more highly down- or up-regulted relative to the extremes of the genes containing motif 3.
An approach using absolute values was implemented to solve this problem [22] where the absolute enrichment can identify important gene sets that may not be identified by two-tailed methods. The CisTransMine method is proposed to test up-regulated genes and down-regulated genes separately for statistical significance by using the one-tailed non-parametric Mann-Whitney test. For up-regulated (and down-regulated respectively) genes, the null hypothesis is that the mean of the ranks in the up-regulated (and down-regulated respectively) genes containing the motif is equal to the mean of ranks in the up-regulated (and down-regulated respectively) genes not containing the motif; the alternative hypothesis is that the mean of ranks in the up-regulated (and down-regulated respectively) genes containing the motif is greater than (less than respectively) the mean of ranks in the up-regulated (and down-regulated respectively) genes not containing the motif. Thus, significances for motifs in up-regulated genes and down-regulated genes are tested separately.
Synergistic motifs
In eukaryotic genomes, a synergistic relationship is present when multiple transcriptional factors work in concert to regulate target genes, e.g., combinatorial activities of multiple transcriptional factors regulate the B cell lineage commitment and differentiation [23]. In the CisTransMine method, synergistic relationships between two transcriptional factors are detected in a two-step process. First, the genes containing transcriptional factor A binding sites (TFA) and transcriptional factor B binding sites (TFB) in the promoter regions can be denoted by TFA ∩ TFB, which is a subset of genes containing both types of binding sites. All the genes containing transcriptional factor A binding sites but not transcriptional factor B binding sites can be denoted by TFA- TFB. All the genes containing transcriptional factor B binding sites but not transcriptional factor A binding sites can be denoted by TFB- TFA. For up-regulated (and down-regulated respectively) genes, the necessary conditions for the true synergy between two transcriptional factors to exist are that (1) one-tailed Mann Whitney rank sum test P-value between genes in the set of TFA ∩ TFB and the genes in the set of TFA- TFB is less than 0.05, (2) one-tailed Mann Whitney rank sum test P-value between genes in the set of TFA ∩ TFB and the genes in the set of TFB- TFA, is less than 0.05. If the necessary conditions are satisfied, the algorithm proceeds to the second step where the significance of the synergistic relationship between the two transcriptional factors is tested with the same method as that for the single motif with the one-tailed Mann-Whitney rank sum test.
Multiple testing correction
In order to reduce the false positive rate, multiple testing correction method must be applied to take into account that thousands of null hypotheses are tested at the same time. The multiple testing correction method we adopt is the False Discovery Rate (FDR) q-value [24]. The FDR q-value is a measure of the rate of false discovery from the distribution of p-values. The FDR q-value method is chosen since it can balance between the specificity and the sensitivity without a priori p-value cutoff (see reference for details).