- Software article
- Open Access
- Open Peer Review
A biologically informed method for detecting rare variant associations
- Carrie Colleen Buchanan Moore†1,
- Anna Okula Basile†2,
- John Robert Wallace3,
- Alex Thomas Frase3 and
- Marylyn DeRiggi Ritchie2, 3Email author
© The Author(s). 2016
- Received: 4 January 2016
- Accepted: 18 June 2016
- Published: 30 August 2016
BioBin is a bioinformatics software package developed to automate the process of binning rare variants into groups for statistical association analysis using a biological knowledge-driven framework. BioBin collapses variants into biological features such as genes, pathways, evolutionary conserved regions (ECRs), protein families, regulatory regions, and others based on user-designated parameters. BioBin provides the infrastructure to create complex and interesting hypotheses in an automated fashion thereby circumventing the necessity for advanced and time consuming scripting.
Purpose of the study
In this manuscript, we describe the software package for BioBin, along with type I error and power simulations to demonstrate the strengths and various customizable features and analysis options of this variant binning tool.
Simulation testing highlights the utility of BioBin as a fast, comprehensive and expandable tool for the biologically-inspired binning and analysis of low-frequency variants in sequence data.
Conclusions and potential implications
The BioBin software package has the capability to transform and streamline the analysis pipelines for researchers analyzing rare variants. This automated bioinformatics tool minimizes the manual effort of creating genomic regions for binning such that time can be spent on the much more interesting task of statistical analyses. This software package is open source and freely available from http://ritchielab.com/software/biobin-download
- Minor Allele Frequency
- Weighting Scheme
- Rare Variant Analysis
- Allele Frequency Threshold
- Genome Project Phase
Recent advances in sequencing technology and drastic decreases in cost have facilitated the generation of a prolific amount of sequence data. This has presented an opportunity for the investigation of low frequency and rare sequence variants beyond traditional genome-wide association (GWA) based approaches. Rare variants have recently been implicated in multifactorial conditions ranging from neurodegenerative diseases like Alzheimer’s and Parkinson’s disease, to metabolic disorders, such as obesity, and various cancers, including both prostate and lung cancer [1–6]. Elucidating the influence of rare variants on common diseases may expand our understanding of the heritability of complex traits, and it may eventually provide information that is useful to clinical patient care through the implementation of personalized, preventive practices.
Even with increased data availability, progress toward understanding rare genomic variation and its association to common human disease lags behind technological sequencing advances. Scientists are hindered in exploiting these advances because strategies for analyzing these data are underdeveloped. The growing disparity in rapidly advancing data collection versus slowly developing data analysis methods mandates a more concerted research effort to develop the necessary analytical tools for successful interpretation of genetic and biological data. Tools designed specifically for rare and low-frequency variant analysis require special considerations as these variants are individually uncommon, and often statistically underpowered for detecting phenotypic association [7, 8]. Also, the large sample size requirements may be prohibitive . To increase the composite allele frequency and analyze smaller sample sizes, collapsing or binning methods are commonly utilized. Collapsing methods aggregate variants into a single genetic variable, which can then be used for subsequent statistical analysis, thereby reducing the number of degrees of freedom and also improving power in the analysis.
Many previous strategies developed for rare variants have focused on the statistical analysis of a pre-defined region rather than how to best group variants in an informative manner. Agnostic or un-informed binning approaches can often lead to a decrease in power when there are variants with different directions of effect or too many neutral variants that mitigate the signal. The most successful collapsing method groups variants likely to have an impact on the function of a specific gene or genomic unit and compares the variant distribution or composite genetic score distribution across the trait of interest.
BioBin [10–12] is a novel bioinformatics tool developed for the multi-level binning of rare variants using a biological knowledge-driven framework. BioBin collapses variants into user-designated biological features such as genes, pathways, evolutionary conserved regions (ECRs), protein families, regulatory regions, and others. Further, BioBin provides the infrastructure to create complex and interesting hypotheses in an automated fashion thereby circumventing the necessity for advanced and time consuming scripting. Simulation testing highlights the utility of BioBin as a fast, comprehensive and expandable tool for the biological binning and analysis of low-frequency variants in sequence data. While multiple biological applications of BioBin have previously been described [10–13], the manuscript herein concentrates on the software features, specifications and various analysis options within the BioBin package. We focus on presenting a comprehensive description of the capabilities of BioBin to provide a resource for users to tailor binning analyses to their specific hypotheses. Additionally, we demonstrate the utility of this software through type I error and power simulations. The BioBin software package has the capability to transform and streamline analysis pipelines for researchers analyzing rare variants in DNA sequencing data. This automated bioinformatics tool minimizes the manual task of curating biologically-relevant regions for binning, such that efforts can instead be spent on subsequent statistical analyses. This software package is open source and freely available from http://ritchielab.com/software/biobin-download.
BioBin resource requirements
BioBin is a stand-alone command line application written in C++ that relies on a locally built Library of Knowledge Integration (LOKI) database to create knowledge-based bins. Source distributions are available for Mac and Linux operating systems and require minimal prerequisites to compile. The BioBin distribution includes tools that allow the user to create and update the LOKI database by downloading information directly from source websites. BioBin is open-source and publicly available for download on the Ritchie lab website (https://ritchielab.com/software/biobin-download).
BioBin software features
Library of Knowledge Integration (LOKI)
LOKI is implemented in SQLite, a relational database management system, which does not require a dedicated database server. A system initially building LOKI should have approximately 100GB of disk storage available for the LOKI database file, the LOKI source data, and space for python installer scripts. An updater script will automatically process and combine information from the various sources into a single database file (some of the temporary files are removed during this process). Once the build is complete, the LOKI database file required to run BioBin will be under 25GB. The script to build LOKI is open source, publicly available on the Ritchie lab website, and is included with the BioBin software. Users with knowledge of relational databases can customize their LOKI database by including or excluding sources, providing additional sources, and updating source information as frequently as needed .
Multi-level binning and filtering
In addition to binning variants based on knowledge, BioBin also provides an option to bin variants that do not associate with any available knowledge. These are known as inter-region bins, or if generated between gene features, intergenic bins. After feature selection using LOKI and/or external custom files, inter-region bins can be created using a configurable width parameter (in kb). These bins catch variants that do not fit into biologically defined feature types (see intergenic bin labels in Fig. 4). For example, if one were testing low frequency burden differences between two groups across genes, all variants in genes would be collapsed into respective gene bins, and variants outside of gene boundaries would be binned based on genomic location in intergenic regions.
Locus selection and models
The framework of a BioBin analysis is to determine biological features upon which data will be binned, such as genes, pathways or intergenic regions, and execute bin generation using LOKI. For locus binning, BioBin follows an allele frequency threshold approach using the non-major allele frequency (NMAF). NMAF is defined as 1 minus the frequency of the most common allele, and at biallelic markers, NMAF and minor allele frequency (MAF) are interchangeable. BioBin allows variants below a user-specified NMAF in the case or the control group to be binned, thereby facilitating the aggregation of both potential risk and protective variants. In order to alleviate increased Type I error, BioBin also gives an option to use the minimum of the NMAF in either case or control group as the value to test against the given NMAF threshold .
BioBin provides multiple disease model options for determining individual contribution in a bin. This includes additive, dominant, or recessive encoding allowing the user to test specific hypotheses using these inheritance patterns. The default option utilizes additive encoding, where each allele adds to an individual bin score.
The power of BioBin becomes apparent in the flexibility provided to the user, which makes the software applicable in a number of low frequency variant analysis pipelines. In addition to the predefined biologically-informed binning strategies, BioBin allows for customized knowledge, adjustable multi-level feature types, filtering strategies and individual variant weighting.
LOKI contains diverse knowledge from many databases, which together provide variant details, region annotations, and group relationships. To accommodate a wide variety of analyses, the user can choose to include or exclude any source in LOKI. Additionally a user can expand on the predefined knowledge contained within this biorepository as LOKI specification and code are open source allowing the addition of desired database sources. For instance, users may specify additional knowledge through the use of plain text files that can define regions, group or variant weights, and roles. Examples of these input files are provided in the BioBin manual (https://ritchielab.com/software/biobin-download). As part of the customization available, BioBin also accepts custom role files, which contain single variant or region annotations. This file can be used to exclude or specifically include variants based on the results returned from an annotation tool such as Polyphen, SIFT, or SNPEff [24–26].
To adjust statistical power in a rare variant analysis, BioBin provides the option of weighting loci according to the weighted sum statistic proposed by Madsen and Browning , in which the weight of a variant is inversely proportional to its MAF. Multiple weighting schemes are provided which use different populations to calculate these locus weights. For instance, in control weighting, weighting is calculated based only on the control population. This weighting represents an exact implementation of Madsen and Browning weighting . Because determining allele rarity solely on the control population has been shown to potentially inflate type I error [28, 29], BioBin implements other weight models allowing the user a means by which to utilize variant weighting while controlling this error. In the maximum model, the weight is the maximum calculated for the case and control populations, while the minimum model uses the minimum weight in these populations. Overall weighting calculates the weight using the entire overall population, regardless of case or control status. The overall weighting scheme is nearly equivalent to the Madsen and Browning weighting implementation in SKAT [30, 31]. These methods will be equivalent in the circumstance where there are no cases, or there is completely missing case or control population for a given locus. Finally, BioBin can also incorporate custom weights based on the user’s prior knowledge.
Simulation parameters. Parameters for the type I error analysis and the power analysis simulations performed using SeqSIMLA2
Type I error analysis
Bin size assessed
Gene-sized bin: 25 kb (50 ± 10 variants) XL_Gene sized bin: 100 kb (200 ± 10 variants)
Gene-sized bin: 25 kb (50 ± 10 variants)
Pathway sized bin: 2–50 gene-sized bins (100–2500 ± 10 variants)
Number of simulations
500 cases, 500 controls
500 cases, 500 controls
Number of causal variants
Odds ratio (OR)
1.25, 1.5,1.75, 2, 2.5, 3, 4, 5
Control only weighting
Control only weighting
Type I error analysis
Parameters for the type I error simulation analysis are listed in the left pane of Table 1. Type I error was assessed by performing three different tests, each varying in the size of the biological bin, as we attempted to simulate datasets that roughly correspond to gene-level and pathway level analyses. The choice of size for gene-based simulations is largely debated, and we decided to test three different bin sizes to accommodate various binning analyses, and to explore the relationship between bin size and type I error. These tests include a 25 kb gene-sized bin (referred to as average gene) composed of 50 variants (standard deviation = 5), a large 100 kb gene-sized bin (referred to as XL gene throughout this work) composed of 200 variants (standard deviation = 5), and a pathway bin composed of 2–50 gene-sized bins, or 100–2500 variants (standard deviation = 5). We chose 50 variants to represent an average sized gene bin by consulting the autosomal variant site statistics reported by 1000 Genomes Project [14, 33] and calculating a rough estimate for the number of possible variants expected in 25 kb, an approximation for median gene size . For each simulation, the specific number of variants was randomly determined. For example, each pathway dataset simulation could contain anywhere from 100 to 2500 variants. Type I error was estimated with 1000 null dataset simulations for each bin size using an odds ratio (OR) of 1, and assessing significance with an α of 0.05 for both regression and Wilcoxon.
To assess the statistical power of each weighting method, power analyses were performed with 1000 simulations of an average sized 25 kb gene bin, containing 50 variants (standard deviation = 5), as described in the right pane of Table 1. For each simulation, 10 causal variants or disease sites were randomly selected in the binned locus. Eight independent simulation tests were performed for each weighting scheme in which the OR of the causal variants was varied as 1.25, 1.5, 1.75, 2, 2.5, 3, 4, and 5. Power was assessed for each of these OR analyses with logistic regression and Wilcoxon using a significance criteria of 0.05.
BioBin is an innovative variant collapsing method that provides a flexible infrastructure for biologically informed variant binning adaptive to individual user needs. In this work, we evaluated four weighting schemes provided within BioBin: control, minimum, maximum and overall weighting, in addition to the no locus weighting option. These weighting methods were examined using two standard burden tests: regression and the Wilcoxon rank sum. While multiple studies have performed exhaustive comparisons of statistical tests for rare variant analyses [35–37], the focus of BioBin is to build versatile and biologically relevant bins rather than to implement a particular statistical analysis. BioBin can provide the necessary files for a user to implement his or her statistical test of choice; this provides the user with freedom to choose the statistical test that is most appropriate for his/her hypothesis. We chose to specifically focus on regression and the Wilcoxon rank sum test as these are very commonly used methods in rare variant analyses [27, 38–41].
Type I error analysis
No correlation between significance and bin size (except with control weighting)
Type I error results. The Type I error simulation results displayed per BioBin weighting scheme tested, biological bin size assessed, and statistical analysis test
Control only weight
Correlation of bin size and significance. Using the control weighting, the larger bins result in a higher chance of a false positive finding, showing a correlation between bin size and p-value. All other weighting strategies have false positive rates independent of bin size
In the present simulations, the no loci weighting option in BioBin presents as statistically more powerful than both overall and minimum weighting. We believe this to be a result of the specific simulation parameters chosen for this analysis, and would likely be altered by the number of binned loci, the allele frequencies of the variants, the direction of the variant effect, and the sample size. Additional simulations were performed in an attempt to demonstrate the influence of chosen parameters on our simulation analyses. We performed comparable power analyses to those noted above, but restricted the selection of variants to only those having a MAF below 5 %, thereby causing all selected disease sites to be binned, and increased the number of casual variants to 20 (standard deviation = 5). The results of this analysis show that simulations without loci weighting had the lowest power across all tested ORs (1.25, 1.75, 2.5, 4 and 5) when compared with all other weighting methods. These results suggest that weighting approaches may have a larger influence on power when the selected disease sites are rare since different results were observed when disease sites with probabilities inversely proportional to the MAF are chosen. Overall, the power results are heavily influenced by simulation methodology, and future work will aim at performing a thorough sweep of simulation parameters and weighting methods in BioBin.
We have performed a preliminary study on incorporating select burden and dispersion-based statistical tests as well as multiple phenotype analysis capabilities into the framework of BioBin . Future work will include comprehensive testing of burden and dispersion methods as well as dissemination of an updated BioBin software package, BioBin 2.2.0, with these additional features.
Overall, BioBin is a powerful and versatile tool for the knowledge-guided biological binning and analysis of low frequency variants in sequence data. BioBin uses a diverse repository of data from a multitude of public sources, and thereby circumvents the necessity of manually curating biologically important data for variant collapsing. BioBin provides users with a flexible and customizable framework to analyze sequence data and uncover novel associations with complex traits.
NIH grants LM010040, HL065962 and the Pennsylvania Department of Health Tobacco CURE Funds were used in the design, analysis, interpretation of data, and in writing the manuscript.
Availability of data and materials
Project name: BioBin
Project home page: http://ritchielab.com/software/biobin-download
Operating system: Linux
Programming language: C++
Other requirements: Boost Libraries for C++, version 1.42 or later; SQLite, version 3.5.4 or later; Python, version 2.7; suds for Python, version 0.4 or later; apsw for Python; Please see manual for most up to date requirements: http://ritchielab.com/files/RL_software/biobin-manual-2.2.pdf
License: GPL, version 3
Any restrictions to use by non-academics: GPL, version 3
Simulation data: Available upon request. The custom script used for generation of reference sequence files is included in the supplemental material.
Programming for BioBin was performed by JRW; writing of the code for LOKI was performed by ATF. CBM and MDR have made substantial contributions to conception and design of this software. Simulation analyses were performed by AOB. CBM, AOB, JRW, ATF and MDR participated in drafting and revising the manuscript and have given final approval of the version to be published.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Cruchaga C, Chakraverty S, Mayo K, Vallania FLM, Mitra RD, Faber K, et al. Rare variants in APP, PSEN1 and PSEN2 increase risk for AD in late-onset Alzheimer’s disease families. PLoS ONE. 2012;7, e31039.View ArticlePubMedPubMed CentralGoogle Scholar
- Cruchaga C, Karch CM, Jin SC, Benitez BA, Cai Y, Guerreiro R, et al. Rare coding variants in the phospholipase D3 gene confer risk for Alzheimer/’s disease. Nature. 2014;505:550–4.View ArticlePubMedGoogle Scholar
- Schulte EC, Fukumori A, Mollenhauer B, Hor H, Arzberger T, Perneczky R, et al. Rare variants in β-Amyloid precursor protein (APP) and Parkinson’s disease. Eur J Hum Genet. 2015;23:1328–33.View ArticlePubMedGoogle Scholar
- Ramachandrappa S, Raimondo A, Cali AMG, Keogh JM, Henning E, Saeed S, et al. Rare variants in single-minded 1 (SIM1) are associated with severe obesity. J Clin Invest. 2013;123:3042–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Bronzetti E, Artico M, Forte F, Pagliarella G, Felici LM, D’Ambrosio A, et al. A possible role of BDNF in prostate cancer detection. Oncol Rep. 2008;19:969–74.PubMedGoogle Scholar
- Wang Y, McKay JD, Rafnar T, Wang Z, Timofeeva MN, Broderick P, et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet. 2014;46:736–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Witte JS. Rare genetic variants and treatment response: sample size and analysis issues. Stat Med. 2012;31:3041–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Stitziel NO, Kiezun A, Sunyaev S. Computational and statistical approaches to analyzing variants identified by exome sequencing. Genome Biol. 2011;12:227.View ArticlePubMedPubMed CentralGoogle Scholar
- Do R, Kathiresan S, Abecasis GR. Exome sequencing and complex disease: practical aspects of rare variant association studies. Hum Mol Genet. 2012;21:R1–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Moore CB, Wallace JR, Frase AT, Pendergrass SA, Ritchie MD. BioBin: a bioinformatics tool for automating the binning of rare variants using publicly available biological knowledge. BMC Med Genomics. 2013;6:S6.PubMedPubMed CentralGoogle Scholar
- Moore CB, Wallace JR, Wolfe DJ, Frase AT, Pendergrass SA, Weiss KM, et al. Low frequency variants, collapsed based on biological knowledge, uncover complexity of population stratification in 1000 genomes project data. PLoS Genet. 2013;9, e1003959.View ArticlePubMedPubMed CentralGoogle Scholar
- Basile AO, Wallace JR, Peissig P, McCarty CA, Brilliant M, Ritchie MD. Knowledge driven binning and phewas analysis in marshfield personalized medicine research project using Biobin. Pac Symp Biocomput Pac Symp Biocomput. 2016;21:249–60.PubMedGoogle Scholar
- Kim D, Li R, Dudek SM, Wallace JR, Ritchie MD. Binning somatic mutations based on biological knowledge for predicting survival: an application in renal cell carcinoma. Pac Symp Biocomput Pac Symp Biocomput. 2015;96–107.Google Scholar
- Consortium T 1000 GP. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65.View ArticleGoogle Scholar
- Rasmussen-Torvik LJ, Stallings SC, Gordon AS, Almoguera B, Basford MA, Bielinski SJ, et al. Design and anticipated outcomes of the eMERGE-PGx project: a multicenter pilot for preemptive pharmacogenomics in electronic health record systems. Clin Pharmacol Ther. 2014;96:482–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Pendergrass SA, Frase A, Wallace J, Wolfe D, Katiyar N, Moore C, et al. Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development. BioData Min. 2013;6:25.View ArticlePubMedPubMed CentralGoogle Scholar
- Resource NCBI. Coordinators. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2013;41:D8–20.View ArticleGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40:D109–14.View ArticlePubMedGoogle Scholar
- Milacic M, Haw R, Rothfels K, Wu G, Croft D, Hermjakob H, et al. Annotating cancer variants and anti-cancer therapeutics in reactome. Cancers. 2012;4:1180–211.View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–30.View ArticlePubMedGoogle Scholar
- Kandasamy K, Mohan S, Raju R, Keerthikumar S, Kumar GSS, Venugopal AK, et al. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 2010;11:R3.View ArticlePubMedPubMed CentralGoogle Scholar
- Bush WS, Dudek SM, Ritchie MD. Biofilter: A Knowledge-Integration System for the Multi-Locus Analysis of Genome-Wide Association Studies. Pac Symp Biocomput Pac Symp Biocomput. 2009;368–79.Google Scholar
- Adzhubei I, Jordan DM, Sunyaev SR. Predicting Functional Effect of Human Missense Mutations Using PolyPhen-2. In: Haines JL, Korf BR, Morton CC, Seidman CE, Seidman JG, Smith DR, editors. Curr. Protoc. Hum. Genet. [Internet]. Hoboken: John Wiley & Sons, Inc; 2013. p. 7.20.1–7.20.41. [cited 2015 Oct 27]. Available from: http://doi.wiley.com/10.1002/0471142905.hg0720s76.View ArticleGoogle Scholar
- Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res. 2001;11:863–74.View ArticlePubMedPubMed CentralGoogle Scholar
- Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6:80–92.View ArticleGoogle Scholar
- Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5, e1000384.View ArticlePubMedPubMed CentralGoogle Scholar
- Lemire M. Defining rare variants by their frequencies in controls may increase type I error. Nat Genet. 2011;43:391–2.View ArticlePubMedGoogle Scholar
- Pearson RD. Bias due to selection of rare variants using frequency in controls. Nat Genet. 2011;43:392–3. author reply 394–5.View ArticlePubMedGoogle Scholar
- Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence Kernel association test. Am J Hum Genet. 2011;89:82–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostat Oxf Engl. 2012;13:762–75.View ArticleGoogle Scholar
- Chung R-H, Tsai W-Y, Hsieh C-H, Hung K-Y, Hsiung CA, Hauser ER. SeqSIMLA2: simulating correlated quantitative traits accounting for shared environmental effects in user-specified pedigree structure. Genet Epidemiol. 2015;39:20–4.View ArticlePubMedGoogle Scholar
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74.View ArticlePubMed CentralGoogle Scholar
- Fuchs G, Voichek Y, Benjamin S, Gilad S, Amit I, Oren M. 4sUDRB-seq: measuring genomewide transcriptional elongation rates and initiation frequencies within cells. Genome Biol. 2014;15:R69.View ArticlePubMedPubMed CentralGoogle Scholar
- Dering C, König IR, Ramsey LB, Relling MV, Yang W, Ziegler A. A comprehensive evaluation of collapsing methods using simulated and real data: excellent annotation of functionality and large sample sizes required. Front Genet [Internet]. 2014;5:323. [cited 2015 Jul 13]. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4164031/.Google Scholar
- Bacanu S-A, Nelson MR, Whittaker JC. Comparison of statistical tests for association between rare variants and binary traits. PLoS ONE. 2012;7, e42530.View ArticlePubMedPubMed CentralGoogle Scholar
- Clarke GM, Rivas MA, Morris AP. A flexible approach for the analysis of rare variants allowing for a mixture of effects on binary or quantitative traits. PLoS Genet. 2013;9, e1003694.View ArticlePubMedPubMed CentralGoogle Scholar
- Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83:311–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Asimit JL, Day-Williams AG, Morris AP, Zeggini E. ARIEL and AMELIA: testing for an accumulation of rare variants using next-generation sequencing data. Hum Hered. 2012;73:84–94.View ArticlePubMedPubMed CentralGoogle Scholar
- Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34:188–93.View ArticlePubMedGoogle Scholar
- Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95:5–23.View ArticlePubMedPubMed CentralGoogle Scholar