Maxmean statistic and restandardization in GSA
In GSA the gene-level test statistics are first converted to z statistics using quantile functions, and then the z values are aggregated into a gene-set-level maxmean statistic. A restandardization procedure compares the maxmean statistic against permutations while also taking into account sets formed by random selections of genes. Given gene-level z statistic z
i
,i=1,…,n, let \(\mathcal {S}\) be the indices of the gene set and \(n_{\mathcal {S}}\) be the size of \(\mathcal {S}\). The maxmean statistic S is defined as,
$$ \begin{aligned} S_{+} & = \frac{1}{n_{\mathcal{S}}} \sum_{i \in \mathcal{S}} z_{i} I\{z_{i} > 0\} \, {,} \\ S_{-} & = -\frac{1}{n_{\mathcal{S}}}\sum_{i \in \mathcal{S}} z_{i} I\{z_{i} < 0\} \, {,} \\ \mathrm{S} & = \max(S_{+}, S_{-}) \, {.} \end{aligned} $$
(1)
Assessing statistical significance of the maxmean statistic requires a null distribution. There are two operations considered to obtain such null distribution, randomization or permutation. For randomization, the null is constructed by randomly sampling a large number of gene sets (row sampling). The problem of randomization is that random sets do not maintain the correlation structure among genes as in real sets. For permutation, the null is estimated by column (sample labels) permutations so that genes still remain in the same set. It maintains the correlation structure. The problem of permutation is that it ignores the distribution of the maxmean statistics of the other gene sets. The restandardization procedure in GSA combines randomization and permutation. Given z statistic z
i
(i=1,…,n) and gene set \(\mathcal {S}\) with \(n_{\mathcal {S}}\) genes, the algorithm works as follows,
-
1)
Randomization: calculate the mean and standard deviation of the maxmean statistics of randomized gene sets, denoted by μ
† and σ
†.
-
2)
Standardization: compute standardized maxmean S
∗, S
∗=(S−μ
†)/σ
†.
-
3)
Permutation: permute sample labels B times and compute the standardized maxmean for each permuted data as in 1) and 2), S
∗b,b=1,…,B.
To assess how significant a gene set is associated with the outcome, the p-value is calculated by comparing S
∗ against the permutations, \(p = \sum _{b=1}^{B} I\{ S^{*b} > S^{*}\}/B\). With restandardization, the null distribution for S is essentially the permuted maxmean statistic rescaled by σ
† and relocated by μ
†. Therefore the maxmean statistic is compared against the permutations while also considering the distribution of the randomized sets.
Restandardization weighted by gene length
In the randomization step of GSA, all genes are sampled with equal probability. In particular, two sets with the same number of genes will have the same μ
† and σ
† regardless of the length of genes in the set. As the method was originally developed for microarrays, it does not consider the potential bias in RNA-seq, as shown in the introduction. To adjust for gene length bias we propose a weighted restandardization algorithm. The weighted algorithm uses the similarity of gene length as sampling weight in the randomization step of the restandardization. The empirical cumulative distribution function (CDF) of the gene length is:
$$ \hat{F}(x) = \sum\limits_{i=1}^{n} I\{l_{i} \le x\}/n \, {,} $$
(2)
where l
i
is the number of base pairs of gene i and n is the total number of genes. Note l
i
for gene length can be retrieved from public databases e.g. Ensembl (http://www.ensembl.org/) and refSeq (http://www.ncbi.nlm.nih.gov/refseq/).
For a set \(\mathcal {S}\), the process of constructing randomized set can be viewed as stepwise replacing genes in \(\mathcal {S}\) with genes from all the genes. Instead of assigning equal probability to all genes, we let q
ij
be the probability of replacing gene i in \(\mathcal {S}\) by gene j, which is weighted by \(1-\vert \hat {F}(l_{i})-\hat {F}(l_{j}) \vert \),
$$ \begin{aligned} w_{ij} &= 1-\vert\hat{F}(l_{i})-\hat{F}(l_{j})\vert \quad \forall \quad i \in \mathcal{S} \quad \text{and} \quad j = 1,\ldots,n \, {,} \\ q_{ij} & = \frac{w_{ij}}{\sum_{j=1}^{n} w_{ij}} \, {.} \end{aligned} $$
(3)
The probability q
ij
is large when gene i and gene j have similar length and it is small when their lengths are very different. For a set \(\mathcal {S}\), let \(q_{\mathcal {S}j}\) be the probability of selecting gene j (j=1,2,…,n) into a randomized set. Since \(n_{\mathcal {S}}<<n\) in practice, the following approximation can be made,
$$ q_{\mathcal{S} j} \approx \frac{\sum_{i \in \mathcal{S}} q_{ij}}{\sum_{j=1}^{n} \sum_{i \in \mathcal{S}} q_{ij}} \, {.} $$
(4)
Random sets are constructed from a multinomial distribution with probability \(q_{\mathcal {S}j}\) for gene j. To differentiate from μ
† and σ
† in the previous section, we denote the mean and standard deviation of weighted randomization by \(\mu _{w}^{\dagger }\) and \(\sigma _{w}^{\dagger }\). We use \(\mu _{w}^{\dagger }\) and \(\sigma _{w}^{\dagger }\) to standardize the maxmean statistic of \(\mathcal {S}\),
$$ S_{w}^{*} = \left(S - \mu_{w}^{\dagger}\right)/\sigma_{w}^{\dagger} \, {.} $$
(5)
Permutation maxmean statistics \(S_{w}^{*b}, b = 1, \ldots, B,\) are computed in the same fashion from permuted data sets. The p-value for \(\mathcal {S}\) is calculated by comparing \(S_{w}^{*}\) and \(S_{w}^{*b}\),
$$ p = \sum\limits_{b=1}^{B} I\left\{ S_{w}^{*b} > S_{w}^{*}\right\}/B \, {.} $$
(6)
Justification of weighted restandardization
In this section we present a model for RNA-seq testing that accommodates a gene length bias. Under the gene length bias, we show our weighted method has more power to detect gene sets consisting of short length genes.
We begin with the two group model proposed in [24] where we assume there are n cases (genes) that are each either null or non-null with probability p
0 and p
1 and with z-values having density either f
0(z) or f
1(z). We assume f
0(z) is the standard normal density and f
1(z) is some no-null density. As mentioned in previous section, RNA-seq tests are not based on traditional z-values and hence we transform the p-values to z-values, via z
i
=Φ
−1(p
i
) where Φ(·) is the standard normal CDF. We have,
$$ z_{i} \sim f(z) = p_{0} f_{0}(z) + p_{1} f_{1}(z) \, {,} $$
(7)
where f(z) is the mixture density composed of f
0(z) and f
1(z). The setting in (7) has proven very useful in microarray analysis, e.g. see [25–28]. To obtain more specific results we also assert the assumption of independence that z
i
’s are independent.
We propose the following twist on this model for RNA-seq analysis. Assume genes are ordered according to the gene length, then in light of RNA-seq, we propose the following more general model,
$$ z_{i} \sim p_{0} f_{0}(z) + p_{1} f_{1i}(z) \, {.} $$
(8)
In short, we allow each gene to follow a different alternative distribution. This flexibility allows us to characterize a gene length bias.
To accommodate the gene length into the model in (8), we assume f
1i
has the following form,
$$ f_{1i}(z) = {rf}_{1i}^{-}(z) + (1-r)f_{1i}^{+}(z) \, {,} $$
(9)
where \(f_{1i}^{-}\) and \(f_{1i}^{+}\) are densities with means \(\mu _{1i}^{-} < 0\) and \(\mu _{1i}^{+} > 0\), respectively. To impose the gene length bias we will further require
$$ \mu_{1i}^{+} \geq \mu_{1j}^{+} \text{ and } \mu_{1i}^{-} \leq \mu_{1j}^{-} \quad \forall \quad i>j \, {.} $$
(10)
For our purposes it is not important to determine the parameters in (9) and (10), however, we will need the assumption in (10) to justify our method. The specification in (10) allows us to handle a gene length bias where, by nature of a gene’s length, it is more likely to have extreme z values than a DE gene of shorter length. For simplicity in the following calculations we also assume f
1i
is symmetric, that is,
$$ f_{1i}(z) = f_{1i}(-z) \quad \forall \quad z \, {.} $$
(11)
The assumption in (11) is plausible, since a priori, it is reasonable to expect a DE gene to be equally likely to be overexpressed as it is to be underexpressed between two conditions. We proceed in the following manner, we derive the asymptotic distribution of the maxmean statistic in (1). We then examine the restandardization procedure and show that our method yields a more accurate restandardization mean for gene sets.
Under (7), by the Lindeberg-Feller theorem [29] if the Lindeberg condition holds, S
+ and S
− in (1) asymptotically follow a bivariate normal distribution for adequately large \(n_{\mathcal {S}}\),
$$ \begin{aligned} \left(\begin{array}{l} S_{+} \\ S_{-} \end{array}\right) \sim N_{2}\left\{ \left(\begin{array}{l} \mu_{+} \\ \mu_{-} \end{array}\right) {,} \left(\begin{array}{cc} \sigma_{+}^{2} & \text{Cov}(S_{+},S_{-}) \\ \text{Cov}(S_{+},S_{-}) & \sigma_{-}^{2} \end{array}\right) \right\} \, {,} \end{aligned} $$
(12)
where μ
+=E(S
+) and \(\sigma _{+}^{2} = V(S_{+})\) and similarly for S
−. Note that under (8)
$$ \begin{aligned} \mu_{+} & = p_{0} \int_{0}^{\infty} z\phi(z)dz + \frac{p_{1}}{n_{\mathcal{S}}} \sum_{i \in \mathcal{S}} \int_{0}^{\infty} {zf}_{1i}(z) dz \\ & = p_{0}\sqrt{\frac{2}{\pi}} + \frac{p_{1}}{n_{\mathcal{S}}} \sum_{i \in \mathcal{S}} \int_{0}^{\infty} {zf}_{1i}(z) dz \, {,} \\ \end{aligned} $$
(13)
and due to the symmetry of f
1i
, μ
−=μ
+.
For completeness the other quantities in (12) are,
$$ \begin{aligned} \sigma_{+}^{2} = \sigma_{-}^{2} & = \left(\frac{1}{n_{\mathcal{S}}}\right)^{2} \sum_{i \in \mathcal{S}} V\left(z_{i}^{+}\right) \\ & = \left(\frac{1}{n_{\mathcal{S}}}\right)^{2} \sum_{i \in \mathcal{S}} \left[E \left(\left(z_{i}^{+}\right)^{2}\right) - \left(E \left(z_{i}^{+}\right)\right)^{2}\right] \\ & = \left(\frac{1}{n_{\mathcal{S}}}\right)^{2} \sum_{i \in \mathcal{S}} \left[\left({\vphantom{\sqrt{\frac{2}{\pi}}}} \int_{0}^{\infty} \frac{\sqrt{z}}{2}\left(p_{0}\phi (\sqrt{z}) + p_{1} f_{1i} (\sqrt{z}) \right)dz \right) \right. \\ & \quad \left. - \left(p_{0} \sqrt{\frac{2}{\pi}} + p_{1} \int_{0}^{\infty} {zf}_{1i}(z)dz \right)^{2}\right] \end{aligned} $$
(14)
$$ \begin{aligned} \text{Cov}(S_{+},S_{-}) &= E(S_{+}S_{-}) - E(S_{+})E(S_{-}) \\ &= -\left(\frac{1}{n_{\mathcal{S}}}\right)^{2} E \left[\left(\sum_{i \in \mathcal{S}} z_{i} I\left(z_{i}>0 \right)\right) \left(\sum_{i \in \mathcal{S}} z_{i} I\left(z_{i}<0 \right)\right)\right] - (\mu_{+})^{2}\\ &= -\left(\frac{1}{n_{\mathcal{S}}}\right)^{2} \left[\sum_{i \neq j \in \mathcal{S}} E\left(z_{i} z_{j} I\left(z_{i}>0 \right) I\left(z_{j}<0 \right)\right)\right] -\left(\mu_{+}\right)^{2}\\ &= -\left(\frac{1}{n_{\mathcal{S}}}\right)^{2}\left[\sum_{i \neq j \in \mathcal{S}} E\left(z_{i} I\left(z_{i}>0\right)\right) E\left(z_{j} I\left(z_{j}<0\right)\right)\right] -\left(\mu_{+}\right)^{2}\\ &= -\left(\frac{1}{n_{\mathcal{S}}}\right)^{2} \left[\sum_{i \neq j \in \mathcal{S}} \left(p_{0}\sqrt{2/\pi} +p_{1}\int_{0}^{\infty}{zf}_{1i}(z)dz\right)\right. \\ & \quad \times \left. \left(p_{0}\sqrt{2/\pi} +p_{1}\int_{-\infty}^{0}{zf}_{1j}(z)dz\right)\!{\vphantom{\sum_{i \neq j \in \mathcal{S}}}}\right] - \left(\mu_{+}\right)^{2} \end{aligned} $$
(15)
From the work in [30, 31] on order statistics, the pdf of S is derived as,
$$ \begin{aligned} f_{S}(s) & = \frac{1}{\sigma_{+}} \phi\left(\frac{-s+\mu_{+}}{\sigma_{+}} \right) \times \Phi\left(\frac{\eta \left(-s+\mu_{+} \right)}{\sigma_{+}\sqrt{1-\eta^{2}}} - \frac{-s+\mu_{-}}{\sigma^{-}\sqrt{1-\eta^{2}}}\right) \\ & \quad + \frac{1}{\sigma_{-}} \phi\left(\frac{-s+\mu_{-}}{\sigma_{-}}\right) \times \Phi\left(\frac{\eta(-s+\mu_{-})}{\sigma_{-}\sqrt{1-\eta^{2}}} - \frac{-s+\mu_{+}}{\sigma_{+}\sqrt{1-\eta^{2}}}\right) \end{aligned} $$
(16)
where η is the correlation of S
+ and S
−, ϕ is the pdf for the standard normal distribution. The mean of S (computed via moment generating functions as in [32]) is
$$ E(S) = \mu_{+} \Phi\left(\frac{\mu_{+}-\mu_{-}}{\theta}\right) + \mu_{-} \Phi\left(\frac{\mu_{-}-\mu_{+}}{\theta}\right) + \theta \phi\left(\frac{\mu_{-}-\mu_{+}}{\theta}\right), $$
(17)
where \(\theta = \left (\sigma _{+}^{2} - 2\eta +\sigma _{-}^{2}\right)^{1/2}\) and \(\sigma _{+}^{2}\), \(\sigma _{-}^{2}\) and η are given in (14) and (15). Hence, we can simplify (17) such that
$$ \begin{aligned} E(S) = & \frac{1}{2} (\mu_{+}) + \frac{1}{2} (\mu_{-}) + \theta \phi(0),\\ = & \mu_{+} + 0.40 \theta \\ = & p_{0}\sqrt{\frac{2}{\pi}} + \frac{p_{1}}{n_{\mathcal{S}}} \sum_{i \in \mathcal{S}} \int_{0}^{\infty} {zf}_{1i}(z) dz +0.40\theta \, {,} \end{aligned} $$
(18)
where each quantity in (18) has been previously derived.
We would like S to be large for DE sets and small for null sets. Given (18) we see that, on average, S will be large for a set of long genes in the presence of a strong gene length bias where \(\int _{0}^{\infty } {zf}_{1i}(z)dz\) increases with the length of gene i. This event does not represent a true biological gene set enrichment and thus we would consider detecting this set a false positive. On the other hand, S could also be large if p
1 is large for the genes in \(\mathcal {S}\) relative to p
1 for the genes not in \(\mathcal {S}\). This situation does reflect true biological enrichment and therefore we would like our procedure to detect significance in this setting.
So in light of the magnitude of S and its randomized distribution for comparison here is the problem: consider a biologically enriched gene set \(\mathcal {S}\) consisting of primarily short genes in the presence of a strong gene length bias. While p
1 is large for \(\mathcal {S}\) it will not be detected because \(1/n_{\mathcal {S}} \sum _{i \in \mathcal {S}} \int _{0}^{\infty } {zf}_{1i}(z) dz\) is relatively large for randomly assembled gene sets \(\mathcal {S}^{\prime }\). Likewise, for null sets with long genes \(\int _{0}^{\infty } {zf}_{1i}(z)dz\) will be large relative to other random sets and we would like a scheme with a low probability of calling this null set significant. Our weighted randomization scheme is designed to increase (relative to the unweighted procedure) the probability of detecting the truly enriched gene set while decreasing the probability of detecting the unenriched gene sets.
Recall the algorithm compares the maxmean statistic against a distribution of permuted maxmean statistics that are centered and scaled according to the mean (μ
†) and variance (σ
†) of maxmean statistics chosen at random. Hence in order for \(\mathcal {S}\) to be enriched, S must be significantly different than permuted maxmean values and random maxmean values. Hence, in light of the scheme above, we would like μ
† to be small for truly differentially enriched sets and large for truly null sets. Heuristically the main difference between the two methods is in the computation of \(\mu _{w}^{\dagger }\) and μ
†. Under the weighted resampling scheme, with K resamplings, we have
$$ \mu_{w}^{\dagger} = \frac{1}{K} \sum\limits_{k = 1}^{K} S_{w}^{\prime k} \, {.} $$
(19)
Hence, the difference between \(\mu _{w}^{\dagger }\) and μ
† can be studied by examining \(E(S_{w}^{\prime k})\) against E(S
′k) from the unweighted approach.
From (18) we have
$$ E(S_{w}) = \mu_{+w} + 0.40 \theta_{w} $$
(20)
where for the unweighted resampling we have,
$$ E(S) = \mu_{+} + 0.40 \theta $$
(21)
We expect the difference between the second order effects θ
w
and θ will be minor relative to the first order effects (μ
+w
, μ
+) between the two methods.
From (13) with the weighted scenario we have
$$ \mu_{+w} = p_{0} \sqrt{\frac{2}{\pi}} + \frac{p_{1}}{n_{\mathcal{S}^{\prime}_{w}}} \sum_{i \in \mathcal{S}^{\prime}_{w}} \int_{0}^{\infty} {zf}_{1i}(z) dz{,} $$
(22)
where the weighted scenario is based on choosing genes in \(\mathcal {S}^{\prime }_{w}\). The unweighted scenario is equivalent except the genes are from \(\mathcal {S}^{\prime }\). By design, the genes in \(\mathcal {S}^{\prime }_{w}\) are chosen to have a similar length to the genes in \(\mathcal {S}\), while the genes in \(\mathcal {S}^{\prime }\) are chosen with equal probability. Hence if i<j then in the presence of a gene length bias in (9) and (10), we have \(\int _{0}^{\infty } {zf}_{1i}(z) dz < \int _{0}^{\infty } {zf}_{1j}(z) dz\). Therefore, if the weighted process is biased to choose genes of smaller length relative to the unweighted process, we are likely to obtain μ
+w
<μ
+. Hence, on average \(S^{\prime }_{w}\) will be smaller relative to the unweighted scenario of S
′ and thus the p-values in the weighted approach will be larger.
This feature produces the following effect: for enriched gene sets consisting primarily of short length genes, the weighted restandardization procedure will be more powerful than the unweighted procedure, while in the same way, the weighted method may also increase the type I error. On the surface this does not seem like an appealing solution. However, we will show in several simulations, our method increase the power while still maintaining the type I error under the controlled level.