Gene set analysis controlling for length bias in RNA-seq experiments
- Xing Ren^{1},
- Qiang Hu^{2},
- Song Liu^{2},
- Jianmin Wang^{2} and
- Jeffrey C. Miecznikowski^{1}Email author
Received: 23 March 2016
Accepted: 11 January 2017
Published: 6 February 2017
Abstract
Background
In gene set analysis, the researchers are interested in determining the gene sets that are significantly correlated with an outcome, e.g. disease status or treatment. With the rapid development of high throughput sequencing technologies, Ribonucleic acid sequencing (RNA-seq) has become an important alternative to traditional expression arrays in gene expression studies. Challenges exist in adopting the existent algorithms to RNA-seq data given the intrinsic difference of the technologies and data. In RNA-seq experiments, the measure of gene expression is correlated with gene length. This inherent correlation may cause bias in gene set analysis.
Results
We develop SeqGSA, a new method for gene set analysis with length bias adjustment for RNA-seq data. It extends from the R package GSA designed for microarrays. Our method compares the gene set maxmean statistic against permutations, while also taking into account of the statistics of the other gene sets. To adjust for the gene length bias, we implement a flexible weighted sampling scheme in the restandardization step of our algorithm. We show our method improves the power of identifying significant gene sets that are affected by the length bias. We also show that our method maintains the type I error comparing with another representative method for gene set enrichment test.
Conclusions
SeqGSA is a promising tool for testing significant gene pathways with RNA-seq data while adjusting for inherent gene length effect. It enhances the power to detect gene sets affected by the bias and maintains type I error under various situations.
Keywords
RNA-seq Gene set analysis Gene length bias Maxmean statistic RestandardizationBackground
Ribonucleic acid sequencing (RNA-seq) is a revolutionary tool for gene expression profiling. It has become an important alternative to traditional expression arrays in varieties of studies. How to adopt the existent algorithms for expression arrays to RNA-seq data is a challenge in data analysis. In microarrays the gene expression is a continuous number, while in RNA-seq it is a non-negative integer indicating the number of reads of a gene. More importantly some inherent biases in RNA-seq experiments need to be accounted for. Given the protocol of RNA-seq, it is reasonable to expect that a longer gene will have more counts than an equally expressed short gene. The length effect will cause bias in gene set analysis [1–3].
There are a number of well-established gene set analysis methods for expression arrays. Most of these tests can be roughly classified into two groups, over-representation analysis (ORA) and functional class scoring (FCS) [5]. In ORA, the genes are labeled as differentially expressed (DE) or null based on thresholding their test statistics or p-values. Then we test the gene sets for over-representing the DE genes. The most common tests are based on hypergeometric, Chi-square or binomial distribution [5]. Despite the wide usage, ORA approaches have a few limitations. First, some information is discarded in ORA as it ignores the continuous measurement of the test statistics and treats the data as binary outcomes (DE or null). Second, the test for over-representation assumes the genes are independent of each other, which is unlikely to hold given the complex interactions between genes, especially for genes in the same pathway.
The FSC approaches overcome the aforementioned limitations of ORA. First the test statistics are computed for individual genes, e.g. correlation [6], ANOVA [7], Q-statistic [8], signal-to-noise ratio [9], t-statistic [7, 10], z-statistic [11]. Then these gene-level statistics are aggregated into a pathway-level summary statistic e.g. Kolmogorov-Smirnov statistic [9, 12], sum, mean or median [13], the Wilcoxon rank sum [14] and the maxmean statistic [15]. By keeping the gene-level statistics, FSC can detect weak but coordinated changes in gene pathways. In assessing the significance of the pathway statistic, the hypothesis test can be classified into two major categories, self-contained or competitive. A self-contained test permutes sample labels and compares the pathway with its permutations, while a competitive test randomizes genes for each pathway, and compares the pathway with others. The self-contained test maintains the correlation structure between genes, but ignores the other pathways. On the other hand, the competitive test takes into account the other pathways but ignores the gene correlation.
Several gene set/pathway analysis methods have been developed for RNA-seq data, e.g. GOSeq [16], GSVA [17], SeqGSEA [18], CAMERA [19]. GOSeq employs a revised hypergeometric test for DE gene enrichment with the sampling probability adjusted to gene length. GSVA estimates variation of pathway activity over a sample population and it is able to detect subtle changes of gene expression in the pathway. SeqGSEA adopts DSGseq [20] and DESeq [21] for gene level test and uses the GSEA method [9] for functional enrichment analysis. CAMERA [19] estimates the inter-gene correlation and accordingly adjusts the gene set test statistic.
We propose a new functional scoring method SeqGSA for gene set analysis with RNA-seq data. Our method adopts the maxmean statistic in R package GSA [15] as the gene-set-level statistic. Comparing with the more commonly used method GSEA [9], it is argued that the maxmean statistic used in GSA is more powerful than the Kolmogorov-Smirnov statistic in GSEA [15]. The fundamental idea of GSA is that it combines the features of both self-contained test and competitive test through a restandardization procedure. As the original GSA, our method maintains the correlation of genes in permutation tests while also taking into account the competition of other random gene sets. We implement a flexible weighted restandardization scheme to adjust for the gene length bias. It works with any gene-level DE test e.g. [21–23]. We show in the presence of a length bias our method improves the power of identifying significant gene sets and controls type I errors. We also compare our method with GOSeq [16] and CAMERA [19], two representative methods for gene set enrichment test. We show that the maxmean statistic is more appropriate in detecting small changes of expression. Our method also maintains the type I error more accurately when genes are correlated.
Methods
Maxmean statistic and restandardization in GSA
- 1)
Randomization: calculate the mean and standard deviation of the maxmean statistics of randomized gene sets, denoted by μ ^{ † } and σ ^{ † }.
- 2)
Standardization: compute standardized maxmean S ^{∗}, S ^{∗}=(S−μ ^{ † })/σ ^{ † }.
- 3)
Permutation: permute sample labels B times and compute the standardized maxmean for each permuted data as in 1) and 2), S ^{∗b },b=1,…,B.
To assess how significant a gene set is associated with the outcome, the p-value is calculated by comparing S ^{∗} against the permutations, \(p = \sum _{b=1}^{B} I\{ S^{*b} > S^{*}\}/B\). With restandardization, the null distribution for S is essentially the permuted maxmean statistic rescaled by σ ^{ † } and relocated by μ ^{ † }. Therefore the maxmean statistic is compared against the permutations while also considering the distribution of the randomized sets.
Restandardization weighted by gene length
where l _{ i } is the number of base pairs of gene i and n is the total number of genes. Note l _{ i } for gene length can be retrieved from public databases e.g. Ensembl (http://www.ensembl.org/) and refSeq (http://www.ncbi.nlm.nih.gov/refseq/).
Justification of weighted restandardization
In this section we present a model for RNA-seq testing that accommodates a gene length bias. Under the gene length bias, we show our weighted method has more power to detect gene sets consisting of short length genes.
where f(z) is the mixture density composed of f _{0}(z) and f _{1}(z). The setting in (7) has proven very useful in microarray analysis, e.g. see [25–28]. To obtain more specific results we also assert the assumption of independence that z _{ i }’s are independent.
In short, we allow each gene to follow a different alternative distribution. This flexibility allows us to characterize a gene length bias.
The assumption in (11) is plausible, since a priori, it is reasonable to expect a DE gene to be equally likely to be overexpressed as it is to be underexpressed between two conditions. We proceed in the following manner, we derive the asymptotic distribution of the maxmean statistic in (1). We then examine the restandardization procedure and show that our method yields a more accurate restandardization mean for gene sets.
and due to the symmetry of f _{1i }, μ _{−}=μ _{+}.
where each quantity in (18) has been previously derived.
We would like S to be large for DE sets and small for null sets. Given (18) we see that, on average, S will be large for a set of long genes in the presence of a strong gene length bias where \(\int _{0}^{\infty } {zf}_{1i}(z)dz\) increases with the length of gene i. This event does not represent a true biological gene set enrichment and thus we would consider detecting this set a false positive. On the other hand, S could also be large if p _{1} is large for the genes in \(\mathcal {S}\) relative to p _{1} for the genes not in \(\mathcal {S}\). This situation does reflect true biological enrichment and therefore we would like our procedure to detect significance in this setting.
So in light of the magnitude of S and its randomized distribution for comparison here is the problem: consider a biologically enriched gene set \(\mathcal {S}\) consisting of primarily short genes in the presence of a strong gene length bias. While p _{1} is large for \(\mathcal {S}\) it will not be detected because \(1/n_{\mathcal {S}} \sum _{i \in \mathcal {S}} \int _{0}^{\infty } {zf}_{1i}(z) dz\) is relatively large for randomly assembled gene sets \(\mathcal {S}^{\prime }\). Likewise, for null sets with long genes \(\int _{0}^{\infty } {zf}_{1i}(z)dz\) will be large relative to other random sets and we would like a scheme with a low probability of calling this null set significant. Our weighted randomization scheme is designed to increase (relative to the unweighted procedure) the probability of detecting the truly enriched gene set while decreasing the probability of detecting the unenriched gene sets.
Hence, the difference between \(\mu _{w}^{\dagger }\) and μ ^{ † } can be studied by examining \(E(S_{w}^{\prime k})\) against E(S ^{′k }) from the unweighted approach.
We expect the difference between the second order effects θ _{ w } and θ will be minor relative to the first order effects (μ _{+w }, μ _{+}) between the two methods.
where the weighted scenario is based on choosing genes in \(\mathcal {S}^{\prime }_{w}\). The unweighted scenario is equivalent except the genes are from \(\mathcal {S}^{\prime }\). By design, the genes in \(\mathcal {S}^{\prime }_{w}\) are chosen to have a similar length to the genes in \(\mathcal {S}\), while the genes in \(\mathcal {S}^{\prime }\) are chosen with equal probability. Hence if i<j then in the presence of a gene length bias in (9) and (10), we have \(\int _{0}^{\infty } {zf}_{1i}(z) dz < \int _{0}^{\infty } {zf}_{1j}(z) dz\). Therefore, if the weighted process is biased to choose genes of smaller length relative to the unweighted process, we are likely to obtain μ _{+w }<μ _{+}. Hence, on average \(S^{\prime }_{w}\) will be smaller relative to the unweighted scenario of S ^{′} and thus the p-values in the weighted approach will be larger.
This feature produces the following effect: for enriched gene sets consisting primarily of short length genes, the weighted restandardization procedure will be more powerful than the unweighted procedure, while in the same way, the weighted method may also increase the type I error. On the surface this does not seem like an appealing solution. However, we will show in several simulations, our method increase the power while still maintaining the type I error under the controlled level.
Results and discussion
Simulation
To assess our method, we compare the type I error and power of the two versions of SeqGSA (weighted restandardization and the unweighted restandardization). The unweighted version is similar to the original GSA, except that the t test is replaced with the exact negative binomial test of edgeR, as the latter is considered a more appropriate test for RNA-seq count data. In addition, we compare SeqGSA with two gene set test methods, GOSeq [16] and CAMERA [19]. GOSeq tests for higher proportion of DE genes in a set using a modified hypergeometric test, with adjustment to gene length bias. CAMERA compares the t statistics of genes inside the set and genes outside the set with adjustment to the intra-correlation of the gene sets. It is embedded in the widely used limma package [33]. We conduct two simulations. In simulation 1, the count data is generated from a Poisson model in which the mean parameter is associated with gene length. The gene sets are constructed without consideration to intra-correlation. In simulation 2, we use the data of an RNA-seq experiment. The gene sets are constructed such that genes in a set are correlated. In both simulations, we let the size of gene set \(n_{\mathcal {S}} = 50\). For GOSeq, the enrichment test requires a threshold of the exact test p-values for DE genes. For our simulations, the threshold is determined by controlling false discovery rate at 0.1 with the Benjamini-Hochberg procedure [34].
Simulation 1: sets of uncorrelated genes
We randomly sample n=1000 genes from the LNCaP data set [4] with gene length between 1000 and 3000 base pairs. Let l _{ i }(i=1,2,…,1000) be the length of gene i. The 1000 genes are grouped into 20 non-overlapping sets according to gene length, with 50 genes in every set. Thus set 1 contains the shortest 50 genes, set 2 contains the next shortest 50, and so on. This would generate length bias in gene set analysis. For each set, we let some number of genes be up-regulated and down-regulated. Let μ _{ i }=l _{ i }/10. In control samples, count x _{ ij }∼P o i s s o n(μ _{ i }). In treatment samples, x _{ ij }∼P o i s s o n((1+β)μ _{ i }) for up-regulated genes, x _{ ij }∼P o i s s o n((1−β)μ _{ i }) for down-regulated genes and x _{ ij }∼P o i s s o n(μ _{ i }) for null genes, where β(β=0.15,0.3) indicates the effect of DE at the gene level.
Type I error of simulation 1
β=0.15 | β=0.3 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Set 1 | Set 5 | Set 10 | Set 15 | Set 20 | Set 1 | Set 5 | Set 10 | Set 15 | Set 20 | |
Weighted | 0.007 | 0.007 | 0.006 | 0.009 | 0.017 | 0 | 0 | 0 | 0.001 | 0.001 |
Unweighted | 0.002 | 0.007 | 0.008 | 0.014 | 0.022 | 0 | 0 | 0 | 0 | 0.003 |
GOSeq | 0.019 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CAMERA | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Power of simulation 1
With length bias | No length bias | ||||||||
---|---|---|---|---|---|---|---|---|---|
β=0.15 | β=0.3 | β=0.15 | |||||||
n _{ de }= | 6 | 8 | 10 | 6 | 8 | 10 | 6 | 8 | 10 |
Weighted | 0.36 | 0.48 | 0.77 | 0.81 | 0.95 | 0.98 | 0.61 | 0.93 | 0.96 |
Unweighted | 0.29 | 0.45 | 0.75 | 0.77 | 0.90 | 0.96 | 0.61 | 0.94 | 0.97 |
GOSeq | 0.23 | 0.38 | 0.56 | 0.97 | 1.0 | 1.0 | 0.96 | 1.0 | 1.0 |
CAMERA | 0.01 | 0 | 0 | 0 | 0 | 0 | 0.02 | 0.01 | 0.02 |
In addition we compare the power under no length bias, i.e. genes are grouped randomly thus no length bias exists among the gene sets. In such case, the weighted algorithm does not have advantage over the unweighted version. There is a slight compromise in the power, but the difference is negligible.
Simulation 2: sets of correlated genes
In simulation 2, we use a public data set with simulated gene sets that contain length bias. The data is a subset of the human prostate cancer data [35] downloaded from the European Bioinformatics Institute (EBI) server (http://www.ebi.ac.uk). There are 11 normal and 12 tumor samples in the subset data. The reads were aligned to the human genome reference sequence (version b37 from http://www.1000genomes.org) by TopHat v2.0.8 [36] and Bowtie v2.1.0 [37]. Gene length is retrieved with the getlength function in GOSeq.
Genes with average count less than 1 or with unretrievable length are filtered out. The 23396 remaining genes are divided into four groups by length with each group containing 25% genes. We use edgeR to calculate the z values and label gene i as DE for |z _{ i }|>Φ ^{−1}(0.975), where Φ ^{−1} is the quantile function of standard normal distribution. The standard deviations of the z values from group 1 to group 4 are 1.76, 2.02, 2.09, 2.08 respectively, showing the scale of z values of group 1 is significantly smaller than others. As a result, the proportion of DE genes (|z|>Φ ^{−1}(0.975)) of the group 1 is lower than the others (26.7% vs 31.1%, 33.6%, 32.9%). Gene sets from group 1 will be affected by the length bias.
Different from simulation 1, the sets in simulation 2 are generated with intra-correlation. We first compare the type I error under correlation. A null gene set is constructed as follows. First a null gene is sampled from the null genes in group 1 as the hub gene [38]. The simulated gene set contains \(n_{\mathcal {S}}\) genes that are sampled from group 1 that have high correlation with the null hub gene (correlation coefficient >0.4). The correlation violates the independence assumption of GOSeq, resulting in an inflated type I error (0.105). On the other hand, the gene correlation is taken into account in the permutation tests of SeqGSA, thus the type I errors are under control for both the weighted algorithm (0.014) and the unweighted algorithm (0.011).
Power of simulation 2
Proportion of DE genes | 0.4 | 0.5 | 0.6 | 0.7 |
---|---|---|---|---|
Weighted | 0.22 | 0.53 | 0.71 | 0.87 |
Unweighted | 0.16 | 0.41 | 0.67 | 0.73 |
CAMERA | 0.09 | 0.19 | 0.26 | 0.39 |
Simulation 3: power on sets of long genes
Power of simulation 3
Proportion of DE genes | 0.4 | 0.5 | 0.6 | 0.7 |
---|---|---|---|---|
Weighted | 0.45 | 0.75 | 0.90 | 0.95 |
Unweighted | 0.49 | 0.76 | 0.88 | 0.95 |
Application to RNA-seq data set
Discussion
We propose SeqGSA, an extension of the GSA method for RNA-seq data. To calculate maxmean statistics from the RNA-seq count data, we replace the t test in the original GSA with a count based test, e.g. the negative binomial test in edgeR. The users can also substitute it with custom defined methods for the gene-level test. More importantly we propose a weighted restandardization approach to accommodate the gene length bias in RNA-seq data. Different from traditional expression arrays, gene length effect generally exists in RNA-seq data sets. To compute the null distribution of the maxmean statistic, GSA employs a randomization strategy that randomly samples genes with equal weights. However, having equal sampling weights ignores the length effect in the data. To adjust for the bias, we use the similarity of gene length as the sampling weight instead, so that the randomized sets will be more likely to consist of genes with similar length. This reduces the bias and yields more accurate null distribution of the maxmean statistics. A DE set comprised mainly of short genes will be more likely called significant in our weighted approach. On the other hand, a null set comprised of long genes will be less significant so that type I error is reduced for these sets. We show by simulations that our method reduces the length bias in RNA-seq data. In addition, we show that the weighted and non-weighted approaches have very similar results when there is no length bias present in the data.
There have been several other methods proposed to remove the length bias in enrichment testing with RNA-seq data [2, 3, 16]. In those works, the enrichment test employs the modified hypergeometric test, Wilcoxon rank sum test or logistic regression. These methods fall in the category of over-representation analysis (ORA) that compares the set against randomized sets. A fundamental difference of our method is that it works under the permutation scheme while also taking into account gene set randomization. An advantage of the permutation test is that it considers the correlation in a gene set and the correlation structure is maintained during sample label permutations. We compare our method against GOSeq in two simulations. GOSeq also implements a procedure for length bias adjustment. In GOSeq, the adjustment occurs on the gene level. A spline is fit to the proportion of DE genes and gene length. This spline is used to correct the sampling probability of the hypergeometric test. In our method, we adjust the bias on the gene set level. We recalculate the randomization mean and standard deviation based on weighted sampling so that it compares against gene sets of similar length. In the simulations, we control the level of DE by the parameter β. The value of β is chosen to represent two settings, β=0.15 for weak DE and β=0.30 for moderate DE, each representing the strength of the two methods. In simulation 1, the gene sets are sampled independently. We show that our method is more powerful in finding sets of genes with weak changes. In particular, when expression of genes is weakly changed, the test statistics for many DE genes fall below the threshold of multiple test adjustment in GOSeq. As a result, the power of detecting set of weakly DE genes is undermined. On the other hand, our method aggregates the weak signals of individual genes and increases the power of detecting such DE sets. When there is no correlation among the simulated gene sets, the type I errors of all methods are under the controlled level. In simulation 2, we use the data of an RNA-seq experiment and compare the three methods on simulated gene sets. The gene sets are constructed such that the genes are strongly correlated. This simulates the fact that many of the pre-defined gene sets are identified by gene-gene correlation. The hypergeometric test of GOSeq is based on the assumption that genes are independent. As a result of violated assumption, the type I error of GOSeq is inflated. On the other hand, the permutation test in SeqGSA takes into account the correlation structure within a gene set and thus maintains the type I error more accurately. We also compare SeqGSA with CAMERA, a competitive gene set test also based on sample permutations with adjustment to gene-set intra-correlation. We show that the maxmean statistic in SeqGSA is more sensitive to subtle but synchronized changes in the gene sets, which has been shown as one of the advantages of the original GSA method [15]. In both simulations, CAMERA has limited power to detect small to moderate changes on the gene-set level.
There are some limitations to our length-weighted method. First there are other sources of bias existent in the RNA-seq data, such as “GC-content”. It has been shown that genes with a large number of guanine (G) and cytosine (C) bases are preferentially read by sequencing machines and the effect may not be monotone [47]. A solution to such bias issues is to correct the bias at the gene level by modeling the number of reads or test statistics by gene length or GC content. However this method has its own problem since the true expression level is unknown. Without a comparison experiment, it is difficult to tell whether the difference in reads and test statistics results from true biological expression or biases. Second, the resampling step in SeqGSA weighted by the empirical CDF of gene length is a simple solution, but it does not completely remove the length bias. To improve the performance of the weighted restandardization, we may consider using tunable parameters to adjust the weight. Determining tuning parameters requires further exploration and assumptions. Third, our method improves the power for sets comprised of short genes, but it may slightly compromise the power to detect enriched sets comprised of long genes. This compromise is common for all methods of length bias adjustment. We feel this comprise is scientifically justifiable since the gain in power for small length gene pathways is much larger than the loss of power for large length gene pathways. The effect of this bias on the family-wise error rate or false discovery rate in multiple gene set testing needs further investigation and is part of our future work on this topic. Fourth, in this paper we focus on the exact binomial test in edgeR for gene-level test. However, there are many other tests that can be considered, e.g. see [21, 23, 48–51]. Alternatively, log fold change can be considered as it is not affected by length. This motivates a question for future work on an optimal test statistic for gene set analysis. Last, the computation performance of the weighted algorithm is slightly slower than the unweighted algorithm as it estimate mean and standard deviation using different weights, but the speeds are comparable as the most computationally intense step, computing the permutation z values, is the same for both methods.
Conclusions
We develop a gene set analysis method for RNA-seq data affected by gene length bias. This novel approach is designed to enhance the power to detect DE gene sets comprised of mainly small length genes. Importantly, we justify our method and demonstrate that it controls the type I error comparing to a representative ORA method for RNA-seq. Also, we show that without the presence of a gene length bias, our approach still performs nearly the same as the original unweighted algorithm in GSA. We expect our gene set analysis method will be of great utility to researchers performing gene set analysis with RNA-seq datasets.
Declarations
Acknowledgements
None.
Funding
This work was supported by Roswell Park Cancer Institute and National Cancer Institute (NCI) grant #P30 CA016056.
Availability of data and materials
SeqGSA can be accessed at https://github.com/xingrenub/SeqGSA.
Authors’ contributions
XR, SL, JW, JM designed the experiments and simulations. QH performed the sequence alignment and bioinformatics. XR performed the statistical analysis and wrote the manuscript. JM conceived the study and helped to draft the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009; 4(1):14.View ArticlePubMedPubMed CentralGoogle Scholar
- Gao L, Fang Z, Zhang K, Zhi D, Cui X. Length bias correction for RNA-seq data in gene set analyses. Bioinformatics. 2011; 27(5):662–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Mi G, Di Y, Emerson S, Cumbie JS, Chang JH. Length Bias Correction in Gene Ontology Enrichment Analysis Using Logistic Regression. PloS one. 2012; 7(10):e46128.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Lovci MT, Kwon YS, Rosenfeld MG, Fu XD, Yeo GW. Determination of tag density required for digital transcriptome analysis: application to an androgen-sensitive prostate cancer model. Proc Natl Acad Sci. 2008; 105(51):20179–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012; 8(2):e1002375.View ArticlePubMedPubMed CentralGoogle Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res. 2004; 29(6):1213–22.View ArticlePubMedGoogle Scholar
- Al-Shahrour F, Díaz-Uriarte R, Dopazo J. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information. Bioinformatics. 2005; 21(13):2988–93.View ArticlePubMedGoogle Scholar
- Goeman JJ, Van De Geer SA, De Kort F, Van Houwelingen HC. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004; 20(1):93–9.View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al.Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005; 102(43):15545–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ. Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005; 102(38):13544–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Kim SY, Volsky DJ. PAGE: parametric analysis of gene set enrichment. BMC Bioinforma. 2005; 6(1):144.View ArticleGoogle Scholar
- Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, et al.PGC-1 α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003; 34(3):267–73.View ArticlePubMedGoogle Scholar
- Jiang Z, Gentleman R. Extensions to gene set enrichment. Bioinformatics. 2007; 23(3):306–13.View ArticlePubMedGoogle Scholar
- Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005; 21(9):1943–9.View ArticlePubMedGoogle Scholar
- Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007; 1(1):107–29.View ArticleGoogle Scholar
- Young MD, Wakefield MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010; 11(2):1–12.View ArticleGoogle Scholar
- Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinforma. 2013; 14(1):7.View ArticleGoogle Scholar
- Wang X, Cairns MJ. SeqGSEA: a Bioconductor package for gene set enrichment analysis of RNA-Seq data integrating differential expression and splicing. Bioinformatics. 2014; 30(12):1777–9.View ArticlePubMedGoogle Scholar
- Wu D, Smyth GK. Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 2012; 40(17):e133.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang W, Qin Z, Feng Z, Wang X, Zhang X. Identifying differentially spliced genes from two groups of RNA-seq samples. Gene. 2013; 518(1):164–70.View ArticlePubMedGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010; 11(10):R106.View ArticlePubMedPubMed CentralGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.View ArticlePubMedGoogle Scholar
- Li J, Witten DM, Johnstone IM, Tibshirani R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012; 13(3):523–38.View ArticlePubMedGoogle Scholar
- Efron B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. vol. 1. Cambridge: Cambridge University Press University Printing House; 2010.View ArticleGoogle Scholar
- Pounds S, Morris SW. Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003; 19(10):1236–42.View ArticlePubMedGoogle Scholar
- Efron B, et al.Microarrays, empirical Bayes and the two-groups model. Stat Sci. 2008; 23(1):1–22.View ArticleGoogle Scholar
- Gold DL, Miecznikowski JC, Liu S. Error control variability in pathway-based microarray analysis. Bioinformatics. 2009; 25(17):2216–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Miecznikowski JC, Gaile DP. A novel characterization of the generalized family wise error rate using empirical null distributions. Stat Appl Genet Mol Biol. 2014; 13(3):299–322.PubMedGoogle Scholar
- Billingsley P. Probability and measure. Hoboken: John Wiley & Sons; 2008.Google Scholar
- Basu A, Ghosh J. Identifiability of the multinormal and other distributions under competing risks model. J Multivar Anal. 1978; 8(3):413–29.View ArticleGoogle Scholar
- Ker AP. On the maximum of bivariate normal random variables. Extremes. 2001; 4(2):185–90.View ArticleGoogle Scholar
- Cain M. The moment-generating function of the minimum of bivariate normal random variables. Am Stat. 1994; 48(2):124–5.Google Scholar
- Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al.limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43(7):gkv007.View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995; 57(1):289–300.Google Scholar
- Ren S, Peng Z, Mao JH, Yu Y, Yin C, Gao X, et al.RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Res. 2012; 22(5):806–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009; 25(9):1105–11.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. 9. 2012; 4:357–9.Google Scholar
- Langfelder P, Mischel PS, Horvath S. When is hub gene selection better than standard meta-analysis?PloS ONE. 2013; 8(4):e61505.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al.Gene Ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, et al.BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics. 2005; 21(16):3439–40.View ArticlePubMedGoogle Scholar
- Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, et al.Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 2010; 8(9):e1000480.View ArticlePubMedPubMed CentralGoogle Scholar
- Sanchez A, Flamini M, Polak K, Palla G, Spina S, Mannella P, et al.Actin cytoskeleton remodelling by sex steroids in neurones. J Neuroendocrinol. 2012; 24(1):195–201.View ArticlePubMedGoogle Scholar
- Giretti MS, Simoncini T. Rapid regulatory actions of sex steroids on cell movement through the actin cytoskeleton. Steroids. 2008; 73(9):895–900.View ArticlePubMedGoogle Scholar
- Li X, Zimmerman A, Copeland NG, Gilbert DJ, Jenkins NA, Yin HL. The mouse thymosin β4 gene: structure, promoter identification, and chromosome localization. Genomics. 1996; 32(3):388–94.View ArticlePubMedGoogle Scholar
- Lee Hr, Yoon SY, Kang HB, Park S, Kim KE, Cho YH, et al.Thymosin beta 4 enhances NK cell cytotoxicity mediated by ICAM-1. Immunol Lett. 2009; 123(1):72–6.View ArticlePubMedGoogle Scholar
- Torikai H, Akatsuka Y, Miyazaki M, Warren EH, Oba T, Tsujimura K, et al.A novel HLA-A* 3303-restricted minor histocompatibility antigen encoded by an unconventional open reading frame of human TMSB4Y gene. J Immunol. 2004; 173(11):7046–54.View ArticlePubMedGoogle Scholar
- Hansen KD, Irizarry RA, Wu Z. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012; 13(2):204–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26(1):136–8.View ArticlePubMedGoogle Scholar
- Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinforma. 2010; 11(1):422.View ArticleGoogle Scholar
- Zhou YH, Xia K, Wright FA. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011; 27(19):2672–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Di Y, Schafer DW, Cumbie JS, Chang JH. The NBP negative binomial model for assessing differential gene expression from RNA-seq. Stat Appl Genet Mol Biol. 2011; 10(1).Google Scholar