This article has Open Peer Review reports available.
Testing multiple hypotheses through IMP weighted FDR based on a genetic functional network with application to a new zebrafish transcriptome study
- Jiang Gui†1, 4Email author,
- Casey S. Greene†2,
- Con Sullivan3, 5,
- Walter Taylor2,
- Jason H. Moore6 and
- Carol Kim3, 5
© Gui et al. 2015
Received: 3 December 2014
Accepted: 8 June 2015
Published: 17 June 2015
In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.
With the rapid development of novel high-throughput deep sequencing technology, the study of functional transcriptomes has changed dramatically. Compared to microarray-based measurements, RNA-sequencing (RNA-seq) technology can profile RNA transcript abundance within greater depth and accuracy. It can effectively detect alternative splicing variants and novel transcripts and does not require an assembled genome sequence. Study  compared high-throughput sequencing data in Illumina and Affymetrix platform and showed that sequencing data are highly replicable, with much smaller technical variation when compared to microarray data. In many cases, it is sufficient to sequence each mRNA sample only once.
RNA-seq analysis raises significant challenges when tens of thousands of transcripts are tested at the same time in order to find those that are differentially expressed under given conditions. Multiple testing correction is required to adequately control type I error, and frequently Bonferroni correction  is used to adjust for multiple testing. The Bonferroni correction is highly stringent, and this correction controls the error rate by adjusting the significance threshold to maintain the specified p-value threshold across all tests. This strict correction is less appropriate for genomic data analysis, where the cost of a false positive is relatively low, the number of observations is relatively low, and the number of tests is high. Alternative criteria that offer a lower false negative rate (i.e. missed findings) are more appropriate in this setting. The false discovery rate (FDR) was proposed by Benjamini and Hochberg  for such settings. This test is more lenient than attempts to control the family wise error rate (FWER), and it allows adequate control of false positives during genomic data analysis and has become widely used [4–7].
While FDR provides an improvement for Type II when compared to FWER-based methods, the FDR method still suffers from reduced power when there are a large number of tests. The incorporation of weights representing the strength of existing evidence was developed by Genovese et al.  and provides a means to further improve power. Genovese et al. proved that weight assignments consistent with the rule that the sum of the weights equaling the total number of tests resulted in effective control of the FDR at the nominal level (i.e. with WFDR at a nominal level of 0.05, 5 % of the discoveries would be false positives just as with the unweighted FDR). In the field of biology, researchers have access to substantial information about biology from both published experiments and existing biological data. Leveraging this knowledge to improve the power of genomic tests would provide significant advantages; and FDR weighting methods that combine biological information and knowledge with gene expression measurements to more effectively identify differentially expressed genes represent an area of unmet need.
The Integrative Multi-species Prediction [9, 10] webserver has collected over 2,000 microarray datasets from the NCBI Gene Expression Omnibus (GEO), large collection of biochemical experiments from the Database of Protein and Genetic Interactions (BioGRID), tissue-specificity and phenotype characterizations from The Zebrafish Model Organism Database (ZFIN) and other sources for seven organisms. IMP integrates these data into functional relationship networks using naïve Bayesian classifiers. Functional networks represent an efficient and comprehensive summary of existing publicly available data. Specifically, in each functional network, each node represents a gene and each edge between two genes is the posterior probability that the two genes are involved in the same process or pathway . IMP is available through a web server (http://imp.princeton.edu/). IMP provides gene-gene functional relationship networks from multiple organisms including zebrafish. The IMP networks have been used to identify novel genes involved in the development of left-right asymmetry in zebrafish embryos, which were validated through knockdown experiments . While these networks have demonstrated utility for guiding targeted experiments, methods that allow researchers to use these networks to effectively analyze new large-scale experiments would address a currently unmet need.
In this paper, we develop and evaluate a novel IMP-WFDR method that uses state-of-the-art genetic network information to infer appropriate weights for WFDR and adjusts the p-values from simultaneous tests to correct for multiple testing issue.
The genetic disorder cystic fibrosis (CF) results from a loss of function variant in the cystic fibrosis transmembrane conductance regulator (CFTR) gene. CF individuals have difficulty with airway clearance, and most will develop chronic P. aeruginosa  infections. While the underlying cause of CF is genetic, the environment is expected to play a role in the many factors of the disease. For example, arsenic has recently been shown to alter CFTR function [13, 14]. We developed a zebrafish model for CF and found that both cftr and arsenic individually affect innate immunity . Here we identify gene expression changes associated with the innate immune response in the context of arsenic exposure. We performed RNA sequencing of control zebrafish as well as those exposed to arsenic at levels seen in the environment (2 ppb and 10 ppb) and those experiencing a chronic P. aeruginosa infection. We apply our newly developed IMP-WFDR method to detect genes that were differentially expressed upon exposure to arsenic and/or infection with P. aeruginosa. The IMP-WFDR method integrates large data compendia through functional networks and focuses these data on the analysis of a new genome-scale experiment to better identify relevant candidates.
Genovese et al.  proposed a simple weighted FDR procedure in which non-negative weights Wi are assigned to each p-value such that ∑ i = 1 m Wi = m. Here m refer th the number of total genes. The BH procedure is applied directly to Qi = Pi/Wi, where Pi denotes the unweighted p-value. They have proven that weighted FDR controls FDR at the nominal level.
Identify a small set of key genes that relate to a biological process of interest.
Input the set of gene names into IMP (https:// imp.princeton.edu) and identify networks that linked to those key genes.
For each gene i in the network, calculate the total relationship confidence i, TRCi, which is the likelihood of connecting with other genes in the network. It can be calculated by summing up the confident scores of the edges linking to gene i.
Set Wi ∝ TRCi and apply weighted FDR to adjust for multiple testing.
Zebrafish experiment and RNA-seq dataset
Weight assignment for zebrafish experiment
To calculate weights for each gene using the IMP-WFDR procedure, we first identified 14 important genes that are associated with arsenic exposure (fn1, notch1a, notch1b, pik3r1, akt2, ass1, nfkb2, foxo5) and immune response (il1b, tnfa, il8, mmp9, irak3, ifnphi1), based on prior biological knowledge.. We queried IMP for these genes and obtained the subnetwork of all genes connected to these 14 starting genes. Out of a total of 27,834 genes, IMP was able to identify 18,803 genes in the subnetwork. The distribution of the genes’ connectivity score is right skewed and 75 % of genes’ TRC are under 0.01 with only 32 genes have a TRC greater than 1. The TRC for the 14 query genes is ranging from 0.16 to 419 with a mean of 72 and standard deviation of 113. Gene ‘fn1’ received a TRC of 419 which is also the largest of all genes. This is not surprising because the query genes are served as hubs for the IMP network. To stabilize the variance, we tried “log + 1” transformation for all TRCs and divided the mean value to generate the weight for Zebrafish experiment.
We developed the novel IMP-weighted FDR (IMP-WFDR) procedure to assign FDR weights. We applied this method to an RNA-seq experiment that tests an organism’s response to exposure to pathogens and arsenic, and found that IMP-WFDR effectively identified important players.
RNA-seq data analysis results
Compare IMP-WFDR and FDR method in zebrafish experiment
Arsenic 0 vs 10 ppb for PBS samples at 6H
Arsenic 0 vs 10 ppb for P. aeruginosa at 6H
PBS vs P. aeruginosa for Arsenic=0 ppb at 12H
PBS vs P. aeruginosa for Arsenic=10 ppb at 6H
GSEA results for overlapping genes
Notch signaling pathway
19.4 % (7/36)
0.8 % (49/6131)
notch1b, jag2 gro2, dlb, dla, dld, jag1a
regulation of defense response
5.6 % (2/36)
0.0 % (3/6131)
regulation of immune effector process
5.6 % (2/36)
0.1 % (4/6131)
response to arsenic containing substance
5.6 % (2/36)
0.1 % (5/6131)
The differential expression of the ifnphi1 gene recapitulates findings that that mRNA expression of the antiviral cytokine interferon (ifn) was differentially expressed in arsenic-exposed zebrafish before and after viral infection in a previous study . It is also important to note that the unweighted FDR procedure fails to identify these differentially expressed genes in any of the 4 comparisons. However, IMP-WFDR identifies the pair in 3 out of the 4 comparisons. This indicates that our proposed method can identify more biologically relevant genes than an unweighted FDR method, which ignores existing biological data and knowledge.
Conclusion and discussion
We have developed a novel IMP-WFDR method that derives weights from a state-of-the-art data integration algorithm and incorporates them in WFDR to more effectively account for multiple test given the context of available biological data and knowledge. IMP has previously been used to guide morpholino assays in zebrafish  and our IMP-WFDR procedure greatly extends IMP’s application by improving the analysis of genomic experiments through such functional networks. We demonstrate through both a simulation study and analysis of RNA-seq data that our proposed method identifies more differentially expressed genes than unweighted FDR or Bonferroni correction, while maintaining appropriate control of the false discovery rate. We used RPKM to obtain p-values as our primary goal was to study IMP Weighted FDR, but in practice edgeR  or DEseq  can be used to calculate p-values for RNA-Seq reads.
The advantage of IMP-WFDR is that it can effectively incorporate expert biological knowledge from published experiments and existing biological data as weights and control false discovery rate to adjust for multiple testing. Roeder and Wasserman proposed a two-valued weighting scheme  to up weight all p-values in a pre-specified priority list and down-weight the rest. The drawback of this approach is that it demotes novel findings since all the unknown genes are down-weighted and have less of a chance to be detected than in the un-weighted case. Rather than up- weight only genes in the priority list, IMP-WFDR obtains its weight based on the functional subgenetic network defined by the priority genes. Thus, the choice of weight is much less dependent on the given priority list. From the Table 1, we see that there were over 200 genes identified by the IMP-WFDR approach while the input list contains only 14 genes. The sensitivity analysis results also support this conclusion. Despite the advantages of IMP-WFDR, it is important to recognize disadvantages of this and other weighted FDR approaches. The weights used by IMP-WFDR are the total relationship confidence, which is ad hoc. Future work will focus on identifying an optimal transformation of the network information into gene specific weights. We will also consider utilizing pathway information, which is contained alongside the IMP networks in the IMP webserver, to identify an optimal weighting scheme.
In conclusion IMP-WFDR is a powerful analytic tool that embraces state-of-the-art genetic network information in identifying differentially expressed genes in high-dimensional settings. We expect that IMP-WFDR and other tools that allow biologists to analyze high-throughut data in the context of biological relationships will play a key role in identifying genes differentially regulated in key models of human physiology.
This project was supported by grants from the National Institutes of Health: 5P20RR024475-02, 8P20GM103534-02, GM103534, LM009012 and LM010098. The RNA sample processing was carried out at Geisel School of Medicine in the Genomics Shared Resource, which was established by equipment grants from the NIH and NSF and is supported in part by a Cancer Center Core Grant (CA 23108) from the National Cancer Institute.
- Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18(9):1509–17. PMCID: PMC2527709.View ArticlePubMedPubMed CentralGoogle Scholar
- Bonferroni CE. Il calcolo delle assicurazioni su gruppi di teste. Tipografia del Senato; 1935.Google Scholar
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57(1):289–300.Google Scholar
- Smyth GK. Limma: Linear models for microarray data. In: Bioinformatics and computational biology solutions using R and Bioconductor. Springer; 2005. p. 397–420.Google Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 2010;11(10):R106.View ArticlePubMedPubMed CentralGoogle Scholar
- Gui J, Tosteson TD, Borsuk M. Weighted multiple testing procedures for genomic studies. BioData Min. 2012;5(1):4. PMCID: PMC3458887.View ArticlePubMedPubMed CentralGoogle Scholar
- Genovese CR, Roeder K, Wasserman L. False discovery control with p-value weighting. Biometrika. 2006;93(3):509–24.View ArticleGoogle Scholar
- Wong AK, Park CY, Greene CS, Bongo LA, Guan Y, Troyanskaya OG. IMP: A multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 2012;40(Web Server issue):W484–90. PMCID: PMC3394282.View ArticlePubMedPubMed CentralGoogle Scholar
- Park CY, Wong AK, Greene CS, Rowland J, Guan Y, Bongo LA, et al. Functional knowledge transfer for high-accuracy prediction of under-studied biological processes. PLoS computational biology. 2013;9(3), e1002957.View ArticlePubMedPubMed CentralGoogle Scholar
- Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, et al. Exploring the human genome with functional maps. Genome Res. 2009;19(6):1093–106. PMCID: PMC2694471.View ArticlePubMedPubMed CentralGoogle Scholar
- Young RL, Malcolm KC, Kret JE, Caceres SM, Poch KR, Nichols DP, et al. Neutrophil extracellular trap (NET)-mediated killing of pseudomonas aeruginosa: Evidence of acquired resistance within the CF airway, independent of CFTR. PLoS One. 2011;6(9), e23637.View ArticlePubMedPubMed CentralGoogle Scholar
- Bomberger JM, Coutermarsh BA, Barnaby RL, Stanton BA. Arsenic promotes ubiquitinylation and lysosomal degradation of cystic fibrosis transmembrane conductance regulator (CFTR) chloride channels in human airway epithelial cells. J Biol Chem. 2012;287(21):17130–9. PMCID: PMC3366821.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaw JR, Bomberger JM, VanderHeide J, LaCasse T, Stanton S, Coutermarsh B, et al. Arsenic inhibits SGK1 activation of CFTR cl< sup>−</sup> channels in the gill of killifish,< i> fundulus heteroclitus</i> Aquatic toxicology. 2010;98(2):157–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Nayak AS, Lage CR, Kim CH. Effects of low concentrations of arsenic on the innate immune system of the zebrafish (danio rerio). Toxicol Sci. 2007;98(1):118–24.View ArticlePubMedGoogle Scholar
- Hibbs MA, Myers CL, Huttenhower C, Hess DC, Li K, Caudy AA, et al. Directing experimental biology: A case study in mitochondrial biogenesis. PLoS computational biology. 2009;5(3), e1000322.View ArticlePubMedPubMed CentralGoogle Scholar
- Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLoS computational biology. 2008;4(9), e1000165.View ArticlePubMedPubMed CentralGoogle Scholar
- Lage CR, Nayak A, Kim CH. Arsenic ecotoxicology and innate immunity. Integr Comp Biol. 2006;46(6):1040–54.View ArticlePubMedGoogle Scholar
- Hermann AC, Kim CH. Effects of arsenic on zebrafish innate immune system. Marine Biotechnology. 2005;7(5):494–505.View ArticlePubMedGoogle Scholar
- Phennicie RT, Sullivan MJ, Singer JT, Yoder JA, Kim CH. Specific resistance to pseudomonas aeruginosa infection in zebrafish is mediated by the cystic fibrosis transmembrane conductance regulator. Infect Immun. 2010;78(11):4542–50. PMCID: PMC2976322.View ArticlePubMedPubMed CentralGoogle Scholar
- Hamming OJ, Lutfalla G, Levraud JP, Hartmann R. Crystal structure of zebrafish interferons I and II reveals conservation of type I interferon structure in vertebrates. J Virol. 2011;85(16):8181–7. PMCID: PMC3147990.View ArticlePubMedPubMed CentralGoogle Scholar
- Bradley LM, Douglass MF, Chatterjee D, Akira S, Baaten BJ. Matrix metalloprotease 9 mediates neutrophil migration into the airways in response to influenza virus-induced toll-like receptor signaling. PLoS pathogens. 2012;8(4), e1002641.View ArticlePubMedPubMed CentralGoogle Scholar
- Anders S, McCarthy DJ, Chen Y, Okoniewski M, Smyth GK, Huber W, et al. Count-based differential expression analysis of RNA sequencing data using R and bioconductor. Nature protocols. 2013;8(9):1765–86.View ArticlePubMedGoogle Scholar
- Roeder K, Wasserman L. Genome-wide significance levels and weighted hypothesis testing. Stat Sci. 2009;24(4):398–413. PMCID: PMC2920568.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.