Gene set analysis controlling for length bias in RNA-seq experiments

Background In gene set analysis, the researchers are interested in determining the gene sets that are significantly correlated with an outcome, e.g. disease status or treatment. With the rapid development of high throughput sequencing technologies, Ribonucleic acid sequencing (RNA-seq) has become an important alternative to traditional expression arrays in gene expression studies. Challenges exist in adopting the existent algorithms to RNA-seq data given the intrinsic difference of the technologies and data. In RNA-seq experiments, the measure of gene expression is correlated with gene length. This inherent correlation may cause bias in gene set analysis. Results We develop SeqGSA, a new method for gene set analysis with length bias adjustment for RNA-seq data. It extends from the R package GSA designed for microarrays. Our method compares the gene set maxmean statistic against permutations, while also taking into account of the statistics of the other gene sets. To adjust for the gene length bias, we implement a flexible weighted sampling scheme in the restandardization step of our algorithm. We show our method improves the power of identifying significant gene sets that are affected by the length bias. We also show that our method maintains the type I error comparing with another representative method for gene set enrichment test. Conclusions SeqGSA is a promising tool for testing significant gene pathways with RNA-seq data while adjusting for inherent gene length effect. It enhances the power to detect gene sets affected by the bias and maintains type I error under various situations.

We use the lymph node carcinoma of the prostate (LNCaP) cells RNA-seq data set [4] as an example to show the gene length bias. The data set contains the RNA sequencing profiles of 3 androgen-treated samples and 4 control samples. For every gene we calculate the average count of the 7 samples. Figure 1 shows the smooth scatterplot of gene length vs average count on a log-log scale. A smooth spline is fit to the scatterplot, suggesting a strong positive correlation between counts and gene length. The dependence of counts and gene length introduces a bias in the test of DE genes such that discoveries will favor long genes over short genes. We rank the genes by their length and divide them into 10 groups, with each group containing 10% of the total genes. We use the exact test in the R package edgeR to obtain the p-value (unadjusted) for every gene. Figure 2 shows the percentage of DE genes (p-value < 0.05) in each group increases with gene length. Given the definition in (1), a set consisting of primarily long genes will have a greater maxmean statistic than an equally expressed set of short genes. As a result, the test will have bias that favors sets of long genes.
There are a number of well-established gene set analysis methods for expression arrays. Most of these tests can be roughly classified into two groups, over-representation analysis (ORA) and functional class scoring (FCS) [5]. In ORA, the genes are labeled as differentially expressed (DE) or null based on thresholding their test statistics or p-values. Then we test the gene sets for over-representing the DE genes. The most common tests Fig. 1 Correlation of count and gene length. The scatterplot of gene length vs average counts (log-log scale) for genes in the LNCaP data set [4]. Darker color indicates higher scatter density. The red curve is a smooth spline with 7 degrees of freedom, showing a positive correlation between count and gene length  [4] shows the probability of significant p-values (p < 0.05) increases as gene length. Group 1 is the genes of shortest length and Group 10 is the genes with longest length are based on hypergeometric, Chi-square or binomial distribution [5]. Despite the wide usage, ORA approaches have a few limitations. First, some information is discarded in ORA as it ignores the continuous measurement of the test statistics and treats the data as binary outcomes (DE or null). Second, the test for over-representation assumes the genes are independent of each other, which is unlikely to hold given the complex interactions between genes, especially for genes in the same pathway.
The FSC approaches overcome the aforementioned limitations of ORA. First the test statistics are computed for individual genes, e.g. correlation [6], ANOVA [7], Q-statistic [8], signal-to-noise ratio [9], t-statistic [7,10], z-statistic [11]. Then these gene-level statistics are aggregated into a pathway-level summary statistic e.g. Kolmogorov-Smirnov statistic [9,12], sum, mean or median [13], the Wilcoxon rank sum [14] and the maxmean statistic [15]. By keeping the gene-level statistics, FSC can detect weak but coordinated changes in gene pathways. In assessing the significance of the pathway statistic, the hypothesis test can be classified into two major categories, self-contained or competitive. A self-contained test permutes sample labels and compares the pathway with its permutations, while a competitive test randomizes genes for each pathway, and compares the pathway with others. The self-contained test maintains the correlation structure between genes, but ignores the other pathways. On the other hand, the competitive test takes into account the other pathways but ignores the gene correlation.
Several gene set/pathway analysis methods have been developed for RNA-seq data, e.g. GOSeq [16], GSVA [17], SeqGSEA [18], CAMERA [19]. GOSeq employs a revised hypergeometric test for DE gene enrichment with the sampling probability adjusted to gene length. GSVA estimates variation of pathway activity over a sample population and it is able to detect subtle changes of gene expression in the pathway. SeqGSEA adopts DSGseq [20] and DESeq [21] for gene level test and uses the GSEA method [9] for functional enrichment analysis. CAMERA [19] estimates the inter-gene correlation and accordingly adjusts the gene set test statistic.
We propose a new functional scoring method SeqGSA for gene set analysis with RNA-seq data. Our method adopts the maxmean statistic in R package GSA [15] as the gene-set-level statistic. Comparing with the more commonly used method GSEA [9], it is argued that the maxmean statistic used in GSA is more powerful than the Kolmogorov-Smirnov statistic in GSEA [15]. The fundamental idea of GSA is that it combines the features of both self-contained test and competitive test through a restandardization procedure. As the original GSA, our method maintains the correlation of genes in permutation tests while also taking into account the competition of other random gene sets. We implement a flexible weighted restandardization scheme to adjust for the gene length bias. It works with any gene-level DE test e.g. [21][22][23]. We show in the presence of a length bias our method improves the power of identifying significant gene sets and controls type I errors. We also compare our method with GOSeq [16] and CAMERA [19], two representative methods for gene set enrichment test. We show that the maxmean statistic is more appropriate in detecting small changes of expression. Our method also maintains the type I error more accurately when genes are correlated.

Maxmean statistic and restandardization in GSA
In GSA the gene-level test statistics are first converted to z statistics using quantile functions, and then the z values are aggregated into a gene-set-level maxmean statistic. A restandardization procedure compares the maxmean statistic against permutations while also taking into account sets formed by random selections of genes. Given gene-level z statistic z i , i = 1, . . . , n, let S be the indices of the gene set and n S be the size of S. The maxmean statistic S is defined as, Assessing statistical significance of the maxmean statistic requires a null distribution. There are two operations considered to obtain such null distribution, randomization or permutation. For randomization, the null is constructed by randomly sampling a large number of gene sets (row sampling). The problem of randomization is that random sets do not maintain the correlation structure among genes as in real sets. For permutation, the null is estimated by column (sample labels) permutations so that genes still remain in the same set. It maintains the correlation structure. The problem of permutation is that it ignores the distribution of the maxmean statistics of the other gene sets. The restandardization procedure in GSA combines randomization and permutation. Given z statistic z i (i = 1, . . . , n) and gene set S with n S genes, the algorithm works as follows, 1) Randomization: calculate the mean and standard deviation of the maxmean statistics of randomized gene sets, denoted by μ † and σ † . 2) Standardization: compute standardized maxmean S * , S * = (S − μ † )/σ † . 3) Permutation: permute sample labels B times and compute the standardized maxmean for each permuted data as in 1) and 2), To assess how significant a gene set is associated with the outcome, the p-value is calculated by comparing S * against the permutations, p = B b=1 I{S * b > S * }/B. With restandardization, the null distribution for S is essentially the permuted maxmean statistic rescaled by σ † and relocated by μ † . Therefore the maxmean statistic is compared against the permutations while also considering the distribution of the randomized sets.

Restandardization weighted by gene length
In the randomization step of GSA, all genes are sampled with equal probability. In particular, two sets with the same number of genes will have the same μ † and σ † regardless of the length of genes in the set. As the method was originally developed for microarrays, it does not consider the potential bias in RNA-seq, as shown in the introduction. To adjust for gene length bias we propose a weighted restandardization algorithm. The weighted algorithm uses the similarity of gene length as sampling weight in the randomization step of the restandardization. The empirical cumulative distribution function (CDF) of the gene length is: where l i is the number of base pairs of gene i and n is the total number of genes. Note l i for gene length can be retrieved from public databases e.g. Ensembl (http://www.ensembl. org/) and refSeq (http://www.ncbi.nlm.nih.gov/refseq/). For a set S, the process of constructing randomized set can be viewed as stepwise replacing genes in S with genes from all the genes. Instead of assigning equal probability to all genes, we let q ij be the probability of replacing gene i in S by gene j, which is weighted by 1 − |F(l i ) −F(l j )|, The probability q ij is large when gene i and gene j have similar length and it is small when their lengths are very different. For a set S, let q Sj be the probability of selecting gene j (j = 1, 2, . . . , n) into a randomized set. Since n S << n in practice, the following approximation can be made, Random sets are constructed from a multinomial distribution with probability q Sj for gene j. To differentiate from μ † and σ † in the previous section, we denote the mean and standard deviation of weighted randomization by μ † w and σ † w . We use μ † w and σ † w to standardize the maxmean statistic of S, Permutation maxmean statistics S * b w , b = 1, . . . , B, are computed in the same fashion from permuted data sets. The p-value for S is calculated by comparing S * w and S * b w ,

Justification of weighted restandardization
In this section we present a model for RNA-seq testing that accommodates a gene length bias. Under the gene length bias, we show our weighted method has more power to detect gene sets consisting of short length genes. We begin with the two group model proposed in [24] where we assume there are n cases (genes) that are each either null or non-null with probability p 0 and p 1 and with zvalues having density either f 0 (z) or f 1 (z). We assume f 0 (z) is the standard normal density and f 1 (z) is some no-null density. As mentioned in previous section, RNA-seq tests are not based on traditional z-values and hence we transform the p-values to z-values, via where f (z) is the mixture density composed of f 0 (z) and f 1 (z). The setting in (7) has proven very useful in microarray analysis, e.g. see [25][26][27][28]. To obtain more specific results we also assert the assumption of independence that z i 's are independent.
We propose the following twist on this model for RNA-seq analysis. Assume genes are ordered according to the gene length, then in light of RNA-seq, we propose the following more general model, In short, we allow each gene to follow a different alternative distribution. This flexibility allows us to characterize a gene length bias.
To accommodate the gene length into the model in (8), we assume f 1i has the following form, where f − 1i and f + 1i are densities with means μ − 1i < 0 and μ + 1i > 0, respectively. To impose the gene length bias we will further require For our purposes it is not important to determine the parameters in (9) and (10), however, we will need the assumption in (10) to justify our method. The specification in (10) allows us to handle a gene length bias where, by nature of a gene's length, it is more likely to have extreme z values than a DE gene of shorter length. For simplicity in the following calculations we also assume f 1i is symmetric, that is, The assumption in (11) is plausible, since a priori, it is reasonable to expect a DE gene to be equally likely to be overexpressed as it is to be underexpressed between two conditions. We proceed in the following manner, we derive the asymptotic distribution of the maxmean statistic in (1). We then examine the restandardization procedure and show that our method yields a more accurate restandardization mean for gene sets.
Under (7), by the Lindeberg-Feller theorem [29] if the Lindeberg condition holds, S + and S − in (1) asymptotically follow a bivariate normal distribution for adequately large n S , where μ + = E(S + ) and σ 2 + = V (S + ) and similarly for S − . Note that under (8) and due to the symmetry of f 1i , μ − = μ + . For completeness the other quantities in (12) are, Cov From the work in [30,31] on order statistics, the pdf of S is derived as, where η is the correlation of S + and S − , φ is the pdf for the standard normal distribution. The mean of S (computed via moment generating functions as in [32]) is where θ = σ 2 + − 2η + σ 2 − 1/2 and σ 2 + , σ 2 − and η are given in (14) and (15). Hence, we can simplify (17) such that where each quantity in (18) has been previously derived. We would like S to be large for DE sets and small for null sets. Given (18) we see that, on average, S will be large for a set of long genes in the presence of a strong gene length bias where ∞ 0 zf 1i (z)dz increases with the length of gene i. This event does not represent a true biological gene set enrichment and thus we would consider detecting this set a false positive. On the other hand, S could also be large if p 1 is large for the genes in S relative to p 1 for the genes not in S. This situation does reflect true biological enrichment and therefore we would like our procedure to detect significance in this setting.
So in light of the magnitude of S and its randomized distribution for comparison here is the problem: consider a biologically enriched gene set S consisting of primarily short genes in the presence of a strong gene length bias. While p 1 is large for S it will not be detected because 1/n S i∈S ∞ 0 zf 1i (z)dz is relatively large for randomly assembled gene sets S . Likewise, for null sets with long genes ∞ 0 zf 1i (z)dz will be large relative to other random sets and we would like a scheme with a low probability of calling this null set significant. Our weighted randomization scheme is designed to increase (relative to the unweighted procedure) the probability of detecting the truly enriched gene set while decreasing the probability of detecting the unenriched gene sets.
Recall the algorithm compares the maxmean statistic against a distribution of permuted maxmean statistics that are centered and scaled according to the mean (μ † ) and variance (σ † ) of maxmean statistics chosen at random. Hence in order for S to be enriched, S must be significantly different than permuted maxmean values and random maxmean values. Hence, in light of the scheme above, we would like μ † to be small for truly differentially enriched sets and large for truly null sets. Heuristically the main difference between the two methods is in the computation of μ † w and μ † . Under the weighted resampling scheme, with K resamplings, we have Hence, the difference between μ † w and μ † can be studied by examining E(S k w ) against E(S k ) from the unweighted approach.
From (18) we have where for the unweighted resampling we have, We expect the difference between the second order effects θ w and θ will be minor relative to the first order effects (μ +w , μ + ) between the two methods. From (13) with the weighted scenario we have where the weighted scenario is based on choosing genes in S w . The unweighted scenario is equivalent except the genes are from S . By design, the genes in S w are chosen to have a similar length to the genes in S, while the genes in S are chosen with equal probability. Hence if i < j then in the presence of a gene length bias in (9) and (10), we have Therefore, if the weighted process is biased to choose genes of smaller length relative to the unweighted process, we are likely to obtain μ +w < μ + . Hence, on average S w will be smaller relative to the unweighted scenario of S and thus the p-values in the weighted approach will be larger.
This feature produces the following effect: for enriched gene sets consisting primarily of short length genes, the weighted restandardization procedure will be more powerful than the unweighted procedure, while in the same way, the weighted method may also increase the type I error. On the surface this does not seem like an appealing solution. However, we will show in several simulations, our method increase the power while still maintaining the type I error under the controlled level.

Simulation
To assess our method, we compare the type I error and power of the two versions of SeqGSA (weighted restandardization and the unweighted restandardization). The unweighted version is similar to the original GSA, except that the t test is replaced with the exact negative binomial test of edgeR, as the latter is considered a more appropriate test for RNA-seq count data. In addition, we compare SeqGSA with two gene set test methods, GOSeq [16] and CAMERA [19]. GOSeq tests for higher proportion of DE genes in a set using a modified hypergeometric test, with adjustment to gene length bias. CAMERA compares the t statistics of genes inside the set and genes outside the set with adjustment to the intra-correlation of the gene sets. It is embedded in the widely used limma package [33]. We conduct two simulations. In simulation 1, the count data is generated from a Poisson model in which the mean parameter is associated with gene length. The gene sets are constructed without consideration to intra-correlation. In simulation 2, we use the data of an RNA-seq experiment. The gene sets are constructed such that genes in a set are correlated. In both simulations, we let the size of gene set n S = 50. For GOSeq, the enrichment test requires a threshold of the exact test p-values for DE genes. For our simulations, the threshold is determined by controlling false discovery rate at 0.1 with the Benjamini-Hochberg procedure [34].

Simulation 1: sets of uncorrelated genes
We randomly sample n = 1000 genes from the LNCaP data set [4] with gene length between 1000 and 3000 base pairs. Let l i (i = 1, 2, . . . , 1000)  We first compare the type I error. Let n de = 2 be the number of up-and down-regulated genes for all 20 sets. Thus all sets are considered as null. Table 1 summarizes the type I error (p-value < 0.05) in selected sets. Results shows the type I error is under control for all sets. In particular, under small DE effect (β = 0.15), comparing with the unweighted method, weighted SeqGSA reduces the type I error in sets of long genes (set 15 and set 20), but increases the error in sets of short genes (set 1). GOSeq has higher type I error for set 1 but it is very conservative for all other sets. Under large DE effect (β = 0.3), all methods are conservative in terms of type I error control and the difference is small. Similarly, CAMERA is very conservative under all settings.
Next we compare the power of identifying DE sets that are undermined by the length bias. We let n de be a greater number (n de = 6, 8, 10) in set 1, thus set 1 becomes the DE set. An ideal test should have small p-values for set 1 and non-significant p-values for the others. Table 2 shows that the weighted algorithm increases the power of detecting set 1. In particular, SeqGSA (both weighted and unweighted algorithms) is more powerful under small DE effect (β = 0.15), while GOSeq is more powerful under more significant change of gene expression (β = 0.3). On the other hand, CAMERA was unable to detect such DE effect, indicating the test statistic of CAMERA is insensitive to small changes of expression under independence.
In addition we compare the power under no length bias, i.e. genes are grouped randomly thus no length bias exists among the gene sets. In such case, the weighted algorithm does not have advantage over the unweighted version. There is a slight compromise in the power, but the difference is negligible.

Simulation 2: sets of correlated genes
In simulation 2, we use a public data set with simulated gene sets that contain length bias. The data is a subset of the human prostate cancer data [35] downloaded from the European Bioinformatics Institute (EBI) server (http://www.ebi.ac.uk). There are 11 normal and 12 tumor samples in the subset data. The reads were aligned to the human genome reference sequence (version b37 from http://www.1000genomes.org) by TopHat v2.0.8 [36] and Bowtie v2.1.0 [37]. Gene length is retrieved with the getlength function in GOSeq.
Genes with average count less than 1 or with unretrievable length are filtered out. The 23396 remaining genes are divided into four groups by length with each group containing 25% genes. We use edgeR to calculate the z values and label gene i as DE for |z i | > −1 (0.975), where −1 is the quantile function of standard normal distribution.
The standard deviations of the z values from group 1 to group 4 are 1.76, 2.02, 2.09, 2.08 respectively, showing the scale of z values of group 1 is significantly smaller than others. As a result, the proportion of DE genes (|z| > −1 (0.975)) of the group 1 is lower than the others (26.7% vs 31.1%, 33.6%, 32.9%). Gene sets from group 1 will be affected by the length bias.
Different from simulation 1, the sets in simulation 2 are generated with intracorrelation. We first compare the type I error under correlation. A null gene set is constructed as follows. First a null gene is sampled from the null genes in group 1 as the hub gene [38]. The simulated gene set contains n S genes that are sampled from group 1 that have high correlation with the null hub gene (correlation coefficient > 0.4). The correlation violates the independence assumption of GOSeq, resulting in an inflated type I error (0.105). On the other hand, the gene correlation is taken into account in the permutation tests of SeqGSA, thus the type I errors are under control for both the weighted algorithm (0.014) and the unweighted algorithm (0.011).
Due to inflated type I error of GOSeq for correlated gene sets, we only compare SeqGSA to CAMERA. A DE hub gene is sampled from DE genes in group 1. The simulated gene set contains a high proportion of DE genes (40 to 70%) that are sampled from DE genes in group 1 and have high correlation with the DE hub gene (correlation coefficient > 0.4). The other genes in the set are sampled from the null genes in group 1. The weighted method improves the power for identifying the simulated DE sets (Table 3). Both the SeqGSA methods have higher power than CAMERA, suggesting the maxmean statistic is more sensitive to subtle change of expression on the gene-set level.

Simulation 3: power on sets of long genes
As the weighted method samples more often from genes with similar length, we expect it should also result in a decrease of the power on sets with long genes. In simulation 3 we evaluate how significant the effect is. The same strategy of simulation 2 on was run on group 3 and 4 for the long genes. Results show although there is a slight decrease in the power for group 3 and 4 overall, the decrease is very minimal compared with the increase on the short gene sets in simulation 2 (Table 4). Therefore the weighted method should increase the overall power of all gene sets.

Application to RNA-seq data set
In this section we first assess the effectiveness of gene length adjustment of SeqGSA with true biological gene sets. Again we use the LNCaP data set [4]. The genes are mapped to the GO categories [39] via R package biomaRt [40] on ENSEMBL (www.ensembl.org) homo sapiens database. The mapped GO categories are filtered by their size. The 1585 categories with 20 to 200 mapped genes are used for the analysis. Small categories are filtered out to prevent outliers in gene length for small gene sets. Large categories are filtered out to avoid slow computation. These categories are grouped into 25 bins of approximately equal size by their median gene length. Figure 3 compares the correlation between the proportion of DE categories and the median length of the bin. A straight line is fit with the slope and its p-value shown on the figure. For the unweighted method the correlation is on the borderline of significance (p = 0.068) while for weighted method the correlation is not significant (p = 0.117).
Next we apply the methods to another public RNA-seq data set. The data set is from the Center for the Study of Human Polymorphisms (CEPH) HapMap samples [41]. It contains 17 female and 24 male human B-cell samples. Genes with average count less than 1 or with unretrievable gene length by the getlength function in GOSeq are filtered out. We map the genes to the 229 KEGG pathways via the getgo function in GOSeq. Only 2549 genes can be mapped to the KEGG pathways. We further filter out pathways with 5 or fewer mapped genes. We run SeqGSA and GOSeq to test DE between male and female group on the remaining 208 pathways. Controlling FDR at 0.1, the weighted SeqGSA identifies the hsa04810 pathway, regulation of actin cytoskeleton. The pathway contains 214 genes, 94 of which are mapped to the data set. It is associated with multiple genetic diseases  including sex-linked disorders as the non-syndromic X-linked mental retardation. Studies have shown the pathway is related to sex steroids regulation of cell morphology and tissue organization that may play an important role in gender-specific differences of brain dysfunction [42,43]. Figure 4 is the heatmap of the hsa04810 pathway for the 41 samples. The genes are ordered by their z-values. The significance of the pathway is driven by gene TMSB4Y (Ensemble gene ID ENSG00000154620) that lies on the forward strand of chromosome Y. Its homolog on chromosome X escapes X inactivation and encodes an actin sequestering protein (provided by RefSeq, Jul 2008). The TMSB4Y gene is shown in relation to multiple biological activities including actin polymerization and depolymerization in non-muscle cells [44], activation of natural killer cell cytotoxicity [45] and minor histocompatibility antigen encoding [46], etc. The unweighted procedure also identifies the same pathway. Figure 5 compares the standardized maxmean statistic for hsa04810 pathway and its permutation statistics of the two algorithms. The two versions of permutation statistics overlap, while the standardized maxmean S * w (red tick) by the weighted method falls further away from the permutation comparing with the unweighted S * (black tick), suggesting that our weighted method is more powerful in finding the pathway. On the other hand, GOSeq and CAMERA do not find any pathway at FDR = 0.1. This example demonstrates our method is easy to implement and gives favorable results in detection of small but coordinated change in gene sets.

Discussion
We propose SeqGSA, an extension of the GSA method for RNA-seq data. To calculate maxmean statistics from the RNA-seq count data, we replace the t test in the original GSA with a count based test, e.g. the negative binomial test in edgeR. The users can also substitute it with custom defined methods for the gene-level test. More importantly we propose a weighted restandardization approach to accommodate the gene length bias in RNA-seq data. Different from traditional expression arrays, gene length effect generally exists in RNA-seq data sets. To compute the null distribution of the maxmean statistic, Fig. 4 Heatmap of hsa04810 KEGG pathway. The heatmap of the 94 mapped genes in the hsa04810 KEGG pathway for the CEPH HapMap data [41]. The count data is log-transformed and row-scaled GSA employs a randomization strategy that randomly samples genes with equal weights. However, having equal sampling weights ignores the length effect in the data. To adjust for the bias, we use the similarity of gene length as the sampling weight instead, so that the randomized sets will be more likely to consist of genes with similar length. This reduces the bias and yields more accurate null distribution of the maxmean statistics. A DE set comprised mainly of short genes will be more likely called significant in our weighted approach. On the other hand, a null set comprised of long genes will be less significant so that type I error is reduced for these sets. We show by simulations that our method reduces the length bias in RNA-seq data. In addition, we show that the weighted and nonweighted approaches have very similar results when there is no length bias present in the data.
There have been several other methods proposed to remove the length bias in enrichment testing with RNA-seq data [2,3,16]. In those works, the enrichment test employs the modified hypergeometric test, Wilcoxon rank sum test or logistic regression. These methods fall in the category of over-representation analysis (ORA) that compares the set against randomized sets. A fundamental difference of our method is that it works under the permutation scheme while also taking into account gene set randomization. An  [41]. The standardardized maxmean S * w (red) by the weighted method falls on the right of the unweighted S * (black), while the permutation statistics of the two methods overlap advantage of the permutation test is that it considers the correlation in a gene set and the correlation structure is maintained during sample label permutations. We compare our method against GOSeq in two simulations. GOSeq also implements a procedure for length bias adjustment. In GOSeq, the adjustment occurs on the gene level. A spline is fit to the proportion of DE genes and gene length. This spline is used to correct the sampling probability of the hypergeometric test. In our method, we adjust the bias on the gene set level. We recalculate the randomization mean and standard deviation based on weighted sampling so that it compares against gene sets of similar length. In the simulations, we control the level of DE by the parameter β. The value of β is chosen to represent two settings, β = 0.15 for weak DE and β = 0.30 for moderate DE, each representing the strength of the two methods. In simulation 1, the gene sets are sampled independently. We show that our method is more powerful in finding sets of genes with weak changes. In particular, when expression of genes is weakly changed, the test statistics for many DE genes fall below the threshold of multiple test adjustment in GOSeq. As a result, the power of detecting set of weakly DE genes is undermined. On the other hand, our method aggregates the weak signals of individual genes and increases the power of detecting such DE sets. When there is no correlation among the simulated gene sets, the type I errors of all methods are under the controlled level. In simulation 2, we use the data of an RNA-seq experiment and compare the three methods on simulated gene sets. The gene sets are constructed such that the genes are strongly correlated. This simulates the fact that many of the pre-defined gene sets are identified by gene-gene correlation. The hypergeometric test of GOSeq is based on the assumption that genes are independent. As a result of violated assumption, the type I error of GOSeq is inflated. On the other hand, the permutation test in SeqGSA takes into account the correlation structure within a gene set and thus maintains the type I error more accurately. We also compare SeqGSA with CAM-ERA, a competitive gene set test also based on sample permutations with adjustment to gene-set intra-correlation. We show that the maxmean statistic in SeqGSA is more sensitive to subtle but synchronized changes in the gene sets, which has been shown as one of the advantages of the original GSA method [15]. In both simulations, CAMERA has limited power to detect small to moderate changes on the gene-set level.
There are some limitations to our length-weighted method. First there are other sources of bias existent in the RNA-seq data, such as "GC-content". It has been shown that genes with a large number of guanine (G) and cytosine (C) bases are preferentially read by sequencing machines and the effect may not be monotone [47]. A solution to such bias issues is to correct the bias at the gene level by modeling the number of reads or test statistics by gene length or GC content. However this method has its own problem since the true expression level is unknown. Without a comparison experiment, it is difficult to tell whether the difference in reads and test statistics results from true biological expression or biases. Second, the resampling step in SeqGSA weighted by the empirical CDF of gene length is a simple solution, but it does not completely remove the length bias. To improve the performance of the weighted restandardization, we may consider using tunable parameters to adjust the weight. Determining tuning parameters requires further exploration and assumptions. Third, our method improves the power for sets comprised of short genes, but it may slightly compromise the power to detect enriched sets comprised of long genes. This compromise is common for all methods of length bias adjustment. We feel this comprise is scientifically justifiable since the gain in power for small length gene pathways is much larger than the loss of power for large length gene pathways. The effect of this bias on the family-wise error rate or false discovery rate in multiple gene set testing needs further investigation and is part of our future work on this topic. Fourth, in this paper we focus on the exact binomial test in edgeR for gene-level test. However, there are many other tests that can be considered, e.g. see [21,23,[48][49][50][51]. Alternatively, log fold change can be considered as it is not affected by length. This motivates a question for future work on an optimal test statistic for gene set analysis. Last, the computation performance of the weighted algorithm is slightly slower than the unweighted algorithm as it estimate mean and standard deviation using different weights, but the speeds are comparable as the most computationally intense step, computing the permutation z values, is the same for both methods.

Conclusions
We develop a gene set analysis method for RNA-seq data affected by gene length bias. This novel approach is designed to enhance the power to detect DE gene sets comprised of mainly small length genes. Importantly, we justify our method and demonstrate that it controls the type I error comparing to a representative ORA method for RNA-seq. Also, we show that without the presence of a gene length bias, our approach still performs nearly the same as the original unweighted algorithm in GSA. We expect our gene set analysis method will be of great utility to researchers performing gene set analysis with RNA-seq datasets.