Variant Set Enrichment: an R package to identify disease-associated functional genomic regions
© The Author(s). 2017
Received: 8 October 2016
Accepted: 14 February 2017
Published: 22 February 2017
Genetic predispositions to diseases populate the noncoding regions of the human genome. Delineating their functional basis can inform on the mechanisms contributing to disease development. However, this remains a challenge due to the poor characterization of the noncoding genome. Here, we propose an R package that can pinpoint which genomic features are etiologically important based on the genetic predispositions.
Variant Set Enrichment (VSE) is an R package to calculate the enrichment of a set of disease-associated variants across functionally annotated genomic regions, consequently highlighting the mechanisms important in the etiology of the disease studied.
VSE is implemented as an R package and can easily be implemented in any system with R.
KeywordsGWAS Disease Noncoding region Enrichment Regulatory region AVS
Over 80% of genetic predisposition, namely risk-loci populated by Single Nucleotide Polymorphisms (SNPs), to human diseases identified by Genome Wide Association Studies (GWAS) map to noncoding DNA [1–3]. In other words, most disease-associated SNPs do not directly alter coding sequences. Over the last decade, the functional annotation of the coding and noncoding genome across a wide collection of cell and tissue types benefited from the integration of maps of transcriptional activity from both coding and noncoding transcripts, such as miRNA and long-noncoding RNAs (lncRNAs) as well as chromatin-protein binding profiles, inclusive of transcription factors and epigenetics modifications, and open chromatin. This functional annotation provides a unique opportunity to delineate the functional basis of genetic predispositions to disease.
Here, we present a computational method, named Variant Set Enrichment (VSE) that computes the enrichment/depletion of the set of genetic predisposition for a disease of interest over functional genomic annotations. We previously used a VSE-based approach to identify the enrichment of Breast Cancer (BCa) genetic predispositions at enhancers bound by FOXA1 and ESR1 in breast cancer cells . VSE relies on the set of genetic predispositions and functional annotations; this renders VSE applicable to the study of any genetically inherited disease for which these data are available.
A genetic predisposition (risk-locus) identified by GWAS corresponds to a SNP found on the GWAS array (termed as “tagSNP”) and all SNPs missing from the array but known to be in Linkage Disequilibrium (LD) with the tagSNP (termed as “ldSNP”) . The sum of all genetic predispositions to a particular disease, ie: all the tagSNPs and their ldSNPs constitute the Associated Variant Set (AVS) for that disease.
The identity of risk-loci is user defined, as the cut-off for LD determination is subjected to study preferences. Occasionally, two or more risk-loci for a particular disease can overlap with one another by a common ldSNP. If the common ldSNP overlaps with a functional genomic annotation of interest, the enrichment score calculated by VSE can be inflated because each risk-locus inclusive of this ldSNP would be counted independently. To correct for this possibility, VSE computes a network of all SNPs in which each SNP is represented as a node and the pairwise LD as an edge. Each cluster in the network represents a disjointed locus, as such, a ldSNP is present only in one locus (Additional file 1: Figure S1). VSE then computes the enrichment score of the AVS for each functional genomic annotation of interest in three sequential steps. In the first step, VSE tallies the number of independent risk-loci that overlaps with the functional genomic annotations. Overlapping of a risk-locus is defined as at least one member SNP found within the functional genomic annotation of interest.
This preliminary tallying of AVS may indicate which genomic annotations are functionally related to risk-associated variants, but the overlapping can be affected by size and structure of the AVS. To correct for these biases, VSE, in the second step, computes a null distribution of the overlap tallies that is based on random permutation of AVS. The null AVS is computed by randomly sampling SNPs from a comprehensive pool of tagSNPs present on the GWAS arrays (Illumina Human OmniExpress) and clustering them with their ldSNPs imputed from the 1000 Genome Project Phase III data. When calculating the set of null AVS, VSE makes sure that each set is built in the way that it has identical total number of null loci as the total number of risk-loci in the AVS; and each null locus is matched in size to the corresponding query locus. We defined each null AVS as Matched Random Variant Sets (MRVS).
In the third step, VSE tallies the overlapping of MRVS with the functional genomic annotations of interest. This provides the null distribution to calculate for the enrichment/depletion of the AVS across different functional genomic annotations. To make the enrichment analysis comparable across all functional genomic annotations of interest, MRVS tally is centered at the median and scaled to the standard deviations of the null distribution. The enrichment score is then defined as the number of standard deviations that the overlapping tally deviates from the null overlapping tally median. VSE calculates an exact P-value for significance of the enrichment/depletion by fitting a density function to the null distribution derived from the MRVS. The level of significance is corrected for multiple testing using Bonferroni method. The deviation of the null distribution from the normality is tested using Kolmogorov-Smirnov test; and if the distribution deviates, the Box-Cox power transformation is applied on to the null to approach normality.
A set of genetic variants that are strongly associated with a particular disease holds clues about the underlying the mechanism of the development of the disease. VSE provides an easy approach to delineate such information by pinpointing the genomic features that are most affected by the genetic predispositions of that particular disease. In our preliminary analysis, we demonstrate that the genetic variants associated with prostate cancer and breast cancer are significantly over-represented in regulatory regions, while the variants associated with the lung cancer are enriched in coding regions. VSE can be easily implemented in R in any platform. A usage vignette is available in the VSE webspage in CRAN repository and also found in the Additional file 2.
Associated variant set
DNaseI hypersensitivity site
Genomewide Association Study
Matched random variant sets
Single nucleotide polymorphisms
Variant set enrichment
We thank Kinjal Desai for helping with setting up the VSE. We acknowledge the ENCODE consortium and the ENCODE production laboratories that generated the data sets provided by the ENCODE Data Coordination Center used in the manuscript.
This work was supported by PMCF (to H.H.H. and M.L.), CFI and ORF (CFI32372 to H.H.H.), NSERC (498706 to H.H.H.), WICC and CCS (703800 to H.H.H.), CCSRI (702922 to M.L.), PCC (RS2016-02 to H.H.H; RS2014-04 to M.L.), CIHR (142246 to H.H.H.), NCI at the NIH (R01CA155004 to M.L.; LM009012 and LM010098 to J.H.M.). M.L. holds a young investigator award from the OICR and a new investigator salary award from the CIHR.
Availability of data and materials
VSE is freely available as an R package in the CRAN repository [https://cran.r-project.org/web/packages/VSE/index.html].
MA developed the R package and drafted the manuscript. RCS developed the enrichment algorithm. HG and JHM tested the package in multiple platforms. HHH and ML supervised the study and contributed significantly to finalizing the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cowper-Sal·lari R, et al. Breast cancer risk–associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nat Genet. 2012;44:1191–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Maurano MT, et al. Systematic Localization of Common Disease-Associated Variation in Regulatory DNA. Science. 2012;337:1190–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Schaub MA, et al. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–59.View ArticlePubMedPubMed CentralGoogle Scholar
- Hindorff LA, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Welter D, et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–6.View ArticlePubMedGoogle Scholar
- The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.View ArticlePubMed CentralGoogle Scholar
- ENCODE. Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.View ArticleGoogle Scholar
- Hazelett DJ, et al. Comprehensive Functional Annotation of 77 Prostate Cancer Risk Loci. PLoS Genet. 2014;10:e1004102.View ArticlePubMedPubMed CentralGoogle Scholar
- Taberlay PC, et al. Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome Res. 2014;24:1421–32.View ArticlePubMedPubMed CentralGoogle Scholar