- Research
- Open access
- Published:
The goldmine of GWAS summary statistics: a systematic review of methods and tools
BioData Mining volume 17, Article number: 31 (2024)
Abstract
Genome-wide association studies (GWAS) have revolutionized our understanding of the genetic architecture of complex traits and diseases. GWAS summary statistics have become essential tools for various genetic analyses, including meta-analysis, fine-mapping, and risk prediction. However, the increasing number of GWAS summary statistics and the diversity of software tools available for their analysis can make it challenging for researchers to select the most appropriate tools for their specific needs. This systematic review aims to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics analysis. We conducted a comprehensive literature search to identify relevant software tools and databases. We categorized the tools and databases by their functionality, including data management, quality control, single-trait analysis, and multiple-trait analysis. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a total of 305 functioning software tools and databases dedicated to GWAS summary statistics, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of GWAS summary statistics analysis.
Background
Genome-wide association studies (GWAS) enable the simultaneous testing of thousands of genetic variants, usually SNPs, across the genome in order to find variants associated with a trait or a disease [1]. The GWAS methodology, so far, has generated many robust associations for various traits and diseases and has revolutionized our understanding of the genetic architecture of complex traits. With increasing sample sizes, new sequencing technologies and the accumulation of large biobanks it is expected that our ability to investigate the effects of human genetic variation in complex traits will increase in the near future [2]. In the first years of the development of the field, efforts were oriented towards the statistical aspects of the analysis [3], which involved thousands of SNPs simultaneously, including the methodology for multiple testing and quality control. This task was successful and enabled the discovery of associations replicated in subsequent studies, and in several cases, validated experimentally and functionally using a wide variety of methods [4]. However, it was soon clear that most variants discovered via GWAS have small overall effects on disease susceptibility [5]. Thus, it became evident that integrating data from multiple sources and developing reliable bioinformatics tools was a necessary step in order to address the complexity of the underlying genetic basis of common human diseases [5].
Soon after the publication of the first GWAS it also became evident that, at least theoretically, individuals could be identified in such cohorts even if only the summary statistics are available [6]. This led to imposing strict control access for sharing individual patients’ data (IPD) from GWAS. Subsequent works found that privacy attacks are possible in theory but unsuccessful and unconvincing in real practice. For instance, even sharing 1,000 SNPs for datasets with more than 500 individuals generally leads to a low power of the “attack” [7]. A more thorough investigation is given in [8]. In practice, however, not all studies share their data, at least when it comes to the studies published in the first decade of GWAS. It has been estimated that the proportion is only 13%, which increased from 3% in 2010 to 23% in 2017 [9]. On the contrary, researchers sharing their summary data has been shown to receive on average 81.8% more citations, an effect that probably is related, at least partially, to the usability of the data in downstream analyses [10]. Summary statistics do not only offer the additional protection of privacy, but also offer significant advantages in computational cost when using the data in downstream analyses, which does not scale with the number of participants in the study [11]. Thus, it is of no surprise that during the last years a large variety of methods have been developed to perform a so-called post-GWAS analysis using the summary results of a single study, or of several studies, and in most cases integrating data from other sources [11]. The majority of these methods use the summary data in the form of per-allele SNP effect sizes (log odds ratios or betas) along with their standard errors, or equivalently the z-scores (per-allele effect sizes divided by their standard errors). These methods seek to go a step further from the simple analysis, or re-analysis of a study, and aim to improve our understanding about the functional role of the identified variants [12]. The most important factors that played significant role in the development of such methods, in this so-called post-GWAS era, is the linkage disequilibrium (LD) information from a population reference panel such as HapMap or 1000 Genomes Project, the gene expression variation in the form of eQTL, and the integration of functional information on biological pathways [13,14,15].
The methods developed so far cover a broad range of different types of analysis, either in the study of a single trait or in the combined analysis of multiple traits. For a single trait, we may have methods for meta-analysis [16, 17], methods for inferring heritability [18, 19], gene-based tests [20], methods for Gene Set (or Pathway) Analysis [21], or methods for fine-mapping causal variants [22]. Regarding the analysis of multiple traits there is also a variety of methods [23], ranging from those that estimate the genetic correlation between traits [24], the joint analysis of multiple traits [25], or the methods that try to estimate causality between traits such as Mendelian Randomization [26], transcriptome-wide association studies [27], or colocalization [28]. Of course, the data standards [29] used to facilitate these analyses and the databases that the results are stored in, are also of great importance for the community.
In order to provide a comprehensive overview of the currently available software tools and databases for GWAS summary statistics we performed a systematic review following the PRISMA guidelines [30]. We conducted a comprehensive search of the literature to identify relevant software tools and databases. We categorized the tools and databases by their functionality, in categories related to data, single-trait analysis, and multiple-trait analysis, along with their sub-categories mentioned in the previous paragraph. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a wide range of software tools and databases for GWAS summary statistics analysis, each with unique strengths and limitations. We provide descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discuss the overall usability and applicability of each tool for different research scenarios. This comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases. By providing a detailed overview of the available tools and databases, we aim to facilitate informed tool selection and maximize the effectiveness of using GWAS summary statistics.
The systematic review
In order to collect all the available published papers, we performed a systematic review of the literature following the PRISMA guidelines [30]. The search was performed in PubMed (https://pubmed.ncbi.nlm.nih.gov) with the following query: ("Summary Statistics" OR "Summary Data" OR "Summary Association Statistics" OR "Summary Association Data") AND (GWAS OR genomewide OR genome-wide). The abstracts initially, and then the full articles were scrutinized in order to collect the necessary information. The inclusion criteria state that methods, software tools and databases, suitable for the analysis of GWAS summary data are suitable for inclusion. Methods papers that do not report software, or software pages not currently available are excluded. Additional searches were performed in the reference lists of the identified articles in order to identify additional studies that were missing. In many cases multiple articles regarding a single tool were found, so we kept only one. We decided to include reports deposited in preprint servers like medRxiv/bioRxiv, but some of these papers were eventually published in peer-review journals, so in such cases we retained only the latter reference. Tools regarding Polygenic Risk Scores (PRSs) and visualization were excluded. For all included tools we recorded the URL, the PMID, and the main functionality/es along with comments regarding its main methodological features. The initial search identified 2942 articles (22/12/2023).
In total we identified 305 tools and databases (Fig. 1). We classified them in three broad categories: data, tools for single traits and tools for multiple traits, along with the various sub-categories. The total breakdown is given in Table 1. Several tools may perform different tasks and thus they can be considered for more than one category; so, we classified them to the one most closely related to the primary goal of the analysis they claim to perform. Other tools do not fit exactly to the general description of the category, but we nevertheless classified them to the most similar one. The largest sub-category consists of the tools for pleiotropy analysis, whereas the smallest one is related to reconstruction of genotypes and effect sizes. Most tools are written in R (56.4%) with the largest proportion being in the multiple traits category, followed by Python (12.5%) and C/C + + (8.2%) (Fig. 2). Apart from the publicly available databases only a handful of tools are offered as webservers (6.95%). Most of the tools were published after 2015 (Fig. 3). Nearly 60% of the tools and databases were published in: Bioinformatics, American Journal of Human Genetics, Nature Genetics, Nature Communications, Nucleic Acids Research and PloS Genetics (Fig. 4). In the following sections we proceed with the detailed description of the various tools identified, classified in the different categories and sub-categories. The complete list of identified tools along with the relevant information is given in Supplementary Table 1.
The data
Firstly, we are going to present the tools dedicated to the data themselves. We include here tools for quality control of GWA summary statistics, tools for imputation and genotype reconstruction as well as the publicly available databases of summary results.
Standards and quality control
The need for sharing and re-using GWAS summary statistics has been an issue for the community during the last years. Generally, it is acceptable that the minimum information (“mandatory”) contained in GWAS summary statistics should include: the chromosome and the base-pair location, the p-value of the association, the risk allele and the other allele, the risk allele frequency, and an estimate of the effect size (odds ratio or beta) along with its standard error [29]. Other important summary statistics that nevertheless termed as “encouraged” ones include the sample size, the variant ID, the rsID, the confidence interval of the effect size and so on. Such specifications were considered for the GWAS-SSF format [31], which was developed to meet the requirements settled by the community. GWAS-SSF consists of a tab-separated data file with well-defined fields and an accompanying metadata file. Most repositories and programs use some variant of the GWAS-SSF. However, such tabular formats in several cases lead to ambiguity or incomplete storage of information, or other times lack essential metadata. This leads to poor performance and increased risk of possible errors in downstream analyses. To address these issues, an adaptation of the well-known variant call format [32] was developed, capable of storing GWAS summary statistics which was called GWAS-VCF along with software tools to apply it in downstream analyses [33]. The VCF contains a file header with metadata and a main file containing variant-level (one locus per row with one or more alternative alleles/variants) and sample-level (one sample per column) information. This way, the VCF was adapted to include GWAS-specific metadata utilizing the sample column to store variant-trait association data. The GWAS-VCF is the standard used by the MRC-IEU OpenGWAS database [34] and it comes with appropriate tools to map GWAS summary statistics to VCF with on-the-fly harmonization (https://github.com/mrcieu/gwas2vcf).
Despite these efforts, not all available data are in line with the standards, especially when dealing with data from older studies. Thus, there is a need for additional tools to harmonize the data, as well as to identify and correct errors. Tools belonging to the former class were developed early and were focused mainly on harmonizing data in preparation of a meta-analysis. These include QCGWAS [35], GWAtoolbox [36] and EasyQC [37]. GEAR [38] is very interesting in that it incorporates ideas from population genetics which allow verification of the genetic origin and geographic location of each cohort and identifying significant sample overlap. More recent tools like MungeSumstats [39] and GWASlab [40] perform standardization and quality control handling the most common formats, SumStatsRehab [41] can be used for data validation, restoration of missing data, correction of errors or formatting, and GWASinspector [42] provides extensive QC reports and perform harmonization being compatible with recent reference panels and by handling insertion/deletion and multi-allelic variants. The latter class of methods, additionally, leverages information from the LD among SNPs. One such tool is GQS [43] which identifies suspicious regions and prevents erroneous interpretations by comparing the significance of the association for each SNP to its LD value for the reported index SNP. Similar functionalities are offered by DENTIST [44] which uses LD to detect and eliminate errors and disagreements between GWAS data and the LD reference panel. EXTminus23andMe [45] evaluates the quality of summary statistics after data removal and the suitability of the down sampled summary statistics for typical follow-up genetic analyses.
Databases
The publicly available biological databases played and continue to play a central role in bioinformatics and in biological research in general [46,47,48]. The same is the case for databases related to human research [49] and in particular those involved in GWAS [50]. The databases we identified can be roughly divided in two categories: databases that contain summary statistics from GWAS and databases that contain important secondary analyses on those data with some of the methods that we will describe in later sections.
Regarding the databases of the first category, NCBI’s dbGAP [51] was developed to contain the results of studies investigating the interaction of genotype and phenotype, which include GWAS. One of the dbGAP’s primary objectives was to house individual level GWAS data, but the database also contains summary data as well. Summary statistics are generally available to the public, whereas access to IPD requires varying levels of authorization. The NHGRI-EBI GWAS Catalog [52], which was established in 2008 is considered for years the central repository of GWAS summary statistics. It is a high-quality curated collection of all published GWAS and as of 2023–12-20, contains 6,680 publications, 566,798 top associations and 66,825 full summary statistics (Fig. 5). The database played an important role in the community efforts leading to the development of GWAS-SSF format. GWAScentral [53] previously known as the Human Genome Variation (HGV) database of Genotype-to-Phenotype information is a database that contains over 72.5 million P-values for over 5,000 studies, with over 7.4 million unique genetic markers involved in more than 1,700 unique phenotypes. The database contains data from several sources (including NHGRI-EBI GWAS Catalog, OpenGWAS, Japanese GWASdb, dbGaP, WTCCC and so on). The IEU MRC OpenGWAS [34] is a new addition and contains 346 million genetic associations from 50,037 GWAS summary datasets. It contains complete data from various consortia and the UK Biobank and comes with a lot of tools for harmonizing the data and storing them in the GWAS-VCF format. At the time of writing there are 4,126 binary traits, 725 metabolites, 3,371 proteins, 3,143 brain imaging phenotypes, and 3,217 other continuous phenotypes. In addition to the complete GWAS summary data, it also contains independent top hits for every dataset, totaling 116,918 independent signals in which 7,109 datasets have at least one hit. GeneATLAS [54] and GBE [55] contain associations from the UK Biobank cohort. GeneATLAS currently contains data for 452,264 individuals, 778 traits and 30 million variants, whereas GBE contains summary statistics from over 750,000 individuals combining data from the UK Biobank, the Million Veterans Program and the Biobank Japan. GTEx [56] and QTLbase [57] are the primary resources for xQTL data. The GTEx project has been expanded over time, and currently contains data of genetic associations for gene expression and splicing in 838 individuals in 49 tissues. QTLbase, similarly, contains genome-wide QTL summary statistics for many molecular traits across 95 tissue/cell types and multiple conditions. Contains tens of millions of significant genotype-molecular trait associations under different conditions. Other resources of this category, related to various large consortia (GIANT, WTCC, PGC etc.) as well as other biobanks (FinnGen etc.) can be found in Supplementary Table 2.
The second category contains databases of important secondary analyses performed on GWAS summary statistics with some of the methods that we describe in detail in later sections, such as gene-based tests, heritability analysis, TWAS, colocalization and so on. TSEA-DB [58] and PCGA [59] use information from gene-expression in various tissues to perform tissue or cell-type enrichment analysis of the GWAS association statistics. webTWAS [60] and COLOCdb [61] also use information on eQTL but in different fashion. webTWAS currently contains data for over 1,389 full GWAS for which it calculates the causal genes using single tissue expression imputation (using MetaXcan and FUSION), or cross-tissue expression imputation (using UTMOST). COLOCdb on the other hand is the most comprehensive colocalization analysis by integrating publicly available GWASs with different types of xQTL and different algorithms (COLOC, SMR). GWAS ATLAS [62] contains results of 4,756 GWAS from 473 unique studies across 3,302 unique traits accompanied by useful information obtained from downstream analysis. Each study is accompanied by MAGMA results (see also “gene-based tests”), SNP heritability estimation and genetic correlations with other traits in the database. GWASROCS [63], on the other hand, contains a large and comprehensive set of SNP-derived AUROCs and heritabilities. Currently includes 579 simulated populations (corresponding to 219 traits) and SNP data (odds ratio, risk allele frequency, and p-values) for 2,886 unique SNPs. Phenome-wide association studies (PheWAS) invert the idea of a GWAS by searching for phenotypes associated with specific variants across the range of thousands of human phenotypes, or the “phenome [64,65,66]. Thus, it is expected that a PheWAS will need large databases of GWAS results. PhenoScanner [67] is the most complete such database with publicly available results from over 65 billion associations and more than 150 million unique genetic variants. Similar functionalities are offered also by OpenGWAS, GWAS ATLAS and PheWAS Catalog [68]. Lastly, we need to mention LD Hub [69], a centralized database of publicly available GWAS results for 173 diseases/traits which offers a web interface that automates the LD score regression (LDSC) analysis pipeline (see also “Genetic correlation”).
Imputation and genotype reconstruction
Although some of the methods for quality control mentioned previously can correct errors and alter the data, the methods used for imputation go one step further. As expected, imputation methods were developed initially for individual data for handling studies genotyped with different platforms [70,71,72]. Such methods can infer missing genotypes using LD information from reference samples genotyped using denser arrays or sequencing. Genotype imputation increases the coverage of SNPs and thus can be used to increase statistical power, increase the accuracy of fine-mapping and harmonize the data in order to facilitate meta-analysis [70]. Several factors can influence the imputation accuracy: the sample size, the suitability of the reference panel for the particular sample, the genotyping chip and the allele frequency [71]. In general, however, these methods are time-consuming since they process individuals one at a time, and thus methods that impute directly the summary statistics were developed. These methods utilize only the information provided in the sample regarding the studied population (p-value, z-score or odds-ratio/beta) and require additional information regarding the LD structure. Nearly all methods perform a kind of multiple regression assuming the multivariate normal distribution for the test statistics and utilizing the theoretical result pointing that the correlation of such test statistics equals the correlation of the corresponding variables [73], that is the genotype correlation, available through the reference panel. Such methods include FAPI [74], ImpG [75], RAISS [76], DIST [77] and SSimp [78] with most of the differences lying in the choice of the reference panel and the exact details of the mathematical methods used to handle matrix inversions in the multivariate normal. DISSCO [79] uses a similar framework but allows for covariates. Such methods may perform poorly in cases where the sample has a different LD structure compared to the reference panel. Thus, extensions such as DISTMIX [80] and ARDISS [81] were developed to handle mixed ethnicity cohorts, improving the imputation performance. Adapt-Mix [82] estimates the correlation structure in both admixed and non-admixed individuals using simulated and real data and allows the use of this matrix with other imputation methods. Other methods such LS-meta [83] and LSimputing [84] offer additional advantages; LS-meta imputes both genetic and environmental components using information from additional omics-trait association summary data, whereas LSimputing implements a non-parametric method that allows for nonlinear SNP-trait associations and predictions in case a sample of IPD is available. Using the same principles, simGWAS [84] allows simulation of whole GWAS summary data, without generating individual data as an intermediate step.
Genotype reconstruction methods take a different approach. Given the summary statistics for a SNP (either directly measured or imputed), one can reconstruct the genotype counts that produced it. This will offer many advantages, since with the reconstructed genotypes the researchers could perform additional analyses using other statistical methods suitable for grouped data and test different hypotheses [85]. For instance, one can calculate grouped Polygenic Risk Scores (PRS) [85], perform logistic regression for grouped data [85, 86], perform multivariate meta-analysis [87], or implement robust tests for association that is expected to work better when the underlying model of inheritance deviates from the additive which is usually assumed [88, 89]. The details and the success of the reconstruction depend heavily on available summary statistics. As one can easily understand, p-values and z-scores cannot be used, and one must rely on available effect sizes such as the odds ratio (OR). When the OR, the standard error and the sample size is given, methods are available in epidemiology that allow the reconstruction of the allelic 2X2 table [90]. If z-scores, confidence intervals or p-values are available one can use them to obtain the standard error. React [85] uses an equivalent method relying on solving a system of nonlinear equations. If the allele frequency in one group (usually the controls) is also known, the allelic counts may easily be obtained with a simple calculation. In all cases the accuracy of the reconstruction may depend on the precision of the available summary statistics. After the allelic 2X2 table is reconstructed, it is straightforward to obtain the genotype counts, assuming HWE (which as one might expect adds another source of potential bias). MetaSustract [91] is a tool that recreates analytically the results of the validation cohort from meta-analysis summary statistics, allowing the researchers to compute meta-analysis summary statistics that are independent of the validation cohort, without requiring access to the IPD. Spkmt [92] works in similar fashion but in families; it can be used to derive the summary statistics of one parent from the data of the offspring and the other parent. Finally, we need to mention two tools that work in somewhat different modes. OATH [93] is used to reproduce reported results from a GWAS and recover underreported results from other alternative models with a different combination of nuisance parameters, whereas LMOR [94] performs transformations from the genetic effects estimated under the Linear Mixed Model to the Odds Ratio that only rely on summary statistics.
Analysis of a single trait
In this section we are going to present the various types of methods and tools dedicated to the analysis of a single trait. These include tools for meta-analysis, tools for the estimation of heritability, tools for implementing gene-based tests, gene set methods and fine mapping methods.
Meta-analysis
One of the most obvious uses of GWAS summary data is to combine them and perform a meta-analysis. Meta-analysis is the statistical procedure used to combine evidence from multiple studies in order to increase statistical power and it is a methodology widely used in medical research for decades [95]. A meta-analysis can be performed with various methods [16] using IPD or summary data; the former offers many advantages, but the latter is far more easy to be performed taking into account the various restrictions imposed on sharing GWAS IPD and the difficulties in the logistics of such a project [17]. Moreover, given the large samples usually encountered in GWAS it has been shown, both theoretically and empirically, that meta-analysis using summary statistics has the same efficiency as the joint analysis of IPD [96]. A compromise between these two extremes arises when a research group has access to individual-level genotype data of a limited sample size and wants to integrate these with existing summary data available in the databases. Such methods are in use in epidemiology for years [97] and several tools have been developed especially for handling GWAS data, for instance IGESS [98], metaGIM [99] and LEP [100]. PolyGIM [101] can be applied with or without IPD and uses polytomous logistic regression to investigate disease subtype heterogeneity in situations when only summary data is available.
Regarding summary-data meta-analysis of GWAS, the most commonly used methods includes standard methods, such as combining p-values, z-statistics or effects sizes like Odds Ratio (for binary traits) or mean differences (for continuous traits) using fixed or random effects models [16, 102]. These statistical methods are straightforward to implement, and are available in general purpose statistical packages such as STATA and R. However, there are several specialized tools that facilitate the process and provide integration with useful bioinformatics or visualization functions. Such widely used tools include METAL [103], GWAMA [104] and PLINK [105]. Other tools are oriented to more specialized cases offering advanced options. For instance, YAMAS performs meta-analysis including missing SNPs identified with LD without performing imputation [106] and rareMETALS [107] uses a partial correlation based score to perform meta-analysis in the presence of large amounts of missing values. There is also a class of tools which focus on the replication of GWAS and the combined analysis of data from primary and replication studies. Such tools include rfdr [108] and Jlfdr [109] which control for False Discovery Rate (FDR), Rrate [110], which determines the sample size of the replication study and checks the consistency between the primary and the replication study, and MAJAR [111] which jointly test prognostic and predictive effects in meta-analysis without the need of using an independent cohort. metaGAP [112] is an online tool for calculating the statistical power of a meta-analysis of GWAS (Fig. 6). METACARPA works with overlapping or related samples, even when details of the overlap or relatedness are unknown [113], MAGENTA [114] performs meta-analysis with gene set enrichment analysis (GSEA), whereas GWASmeta [115] and MetABF [116] work in a bayesian framework calculating the Approximate Bayes Factor (ABF). Other tools offer more advanced options such as meta-analysis with multiple traits (see also “multiple traits”), like nGWAMA [117], metaCCA [118], CPASSOC [119], metaUSAT [120] and CPBayes [114] (and its extension GCPBayes [121]), and others are designed for meta-analysis under different genetic models, like GWAR [89] which uses robust methods (like MIN2 or MAX) in order to handle the uncertainty in the underlying genetic model, or like the simulation tool [122] which implements an alternate strategy for the additive genetic model simulating data for the individual studies. Finally, we need to mention sPLINK [123] which performs privacy-aware GWAS on distributed datasets, and XPEB [124] which is an empirical Bayes approach designed to improve the power GWAS in minority populations by exploiting information from GWASs performed in populations of different origin.
Inferring heritability
Heritability is generally defined as the fraction of phenotypic variation explained by genetic variation. Heritability is a dimensionless parameter of the population, and it was introduced by Sewall Wright and Ronald Fisher in the previous century. Traditionally, heritability is estimated using family-based designs such as twin studies. However, there are controversies regarding the various methodologies for estimation and interpretation of the results [125]. Despite all these, heritability is an important aspect of research in modern genetics, and regarding the prediction of disease risk from genomic data [126]. The technological advancements have facilitated the development of methods that use large samples of unrelated, or related, individuals. Thus, family-based designs using genomic data (trio-genome-wide complex trait analysis, and so on) have emerged. Such methods are discussed and compared in [127]. Of course, heritability can also be estimated via the results obtained in a traditional GWAS using unrelated individuals. The gap between these estimates and those obtained from classical heritability estimation methods has been termed the "missing heritability problem" and it is an important open question in current research [128]. Recent reviews of the methods that use GWAS data, are given in [18, 19] focusing on their modeling assumptions, their similarities, and their applicability.
One of the first and simplest methods to calculate heritability from allele frequency, odds ratio and prevalence of the disease was implemented in the SumVg package [129]. This method, however, utilizes only the significant SNPs. The same authors extended the method later in order to allow calculation using the z-statistics from the whole GWAS sample [130]. A disadvantage of this method is that LD is not taken care of, and highly correlated SNPs need to be filtered manually. AVENGEME [131] is a tool that treats causal effect sizes as fixed effects and models the genotypes as random correlated variables. HESS [132] which was presented later built upon the same ideas and can be viewed as a weighted sum of the squares of the projection of effect sizes onto the eigenvectors of the LD matrix at the particular locus, with weights inversely proportional to the corresponding eigenvalues. LD Score Regression (LDSC) has been frequently applied to summary statistics from GWAS and one of its functionalities is to estimate the SNP heritability of a trait [133]. LDER [134] extends LDSC making full use of the information from the LD matrix providing more accurate estimates, whereas s-LDSC [135] is an extension suitable for partitioning heritability. SumHer [136] presented later and offers the same functionalities, with the main difference being that it allows for different so called “heritability models”. According to these, a SNP with high MAF is expected to contribute more to the total heritability compared to one with low MAF, whereas on the other hand, a SNP in a region of low LD is expected to contribute more compared to one in a region of high LD. On the contrary, LDSC estimates are obtained by assuming that all SNPs contribute equally. HEELS [137] is a new tool using REML to produce accurate and precise local heritability estimates and RSS, is a multiple regression-based fine-mapping tool (see “Fine-mapping”), can also calculate SNP heritability from the regression model. VarExp [138] and GxESum [139] are methods for estimating the phenotypic variance explained by genome-wide gene-environment (GxE) interactions. There are also tools like GWIZ [63] and SummaryAUC [140] that calculate the Receivers Operator’s Characteristic (ROC) curve and the associated Area Under the Curve (AUC). GWIZ generates ROC curves and the AUC using simulations and then estimates heritability using the square of the Somers’ rank correlation D. SummaryAUC on the other hand approximates the AUC of a PRS and its variance. HAMSTA [141] is a tool that, among others, estimates heritability explained by local ancestry using data from admixture mapping studies. Estimating the Effect size distribution is also a related important concept. GENESIS [142] uses LD and a Likelihood-based approach to estimate effect-size distributions. It also allows predictions regarding yield of future GWAS with larger sample sizes. GWEHS [143] calculates the distribution of effect sizes of SNPs, as well as their contribution to trait heritability. Furthermore, it performs predictions for the change in the effect size as well as in the heritability when new variants are identified. FMR [144] is a method-of-moments for calculating the effect-size distribution and GWAS-Causal-Effects-Model [145] is a random effects model for estimating the causal variants and their effect size distribution. Finally, there are tools to implicate gene-expression in heritability analysis: MESC [146] which estimates the proportion of heritability mediated by gene expression levels using linkage disequilibrium (LD) scores and eQTL, and GCSC [147] which uses results from a TWAS (see “TWAS and Colocalization”) in the so-called gene co-regulation score regression, to identify gene sets enriched for disease heritability.
Gene-based tests
Historically, association tests are oriented towards single variants, and this was the case for both traditional association studies as well as for GWAS. However this approach has some limitations that were noted earlier and a call for a shift towards gene-based tests was made [148]. Gene-based tests aggregate individual variant associations within a gene, providing a more comprehensive assessment of the gene's overall contribution to a trait or disease. This approach helps prioritize genes with multiple associated variants, enhancing the biological relevance of findings, and it has proven to be useful particularly in case of low frequency variants [148]. There are plenty of different methods for combining the association statistics or p-values within a gene, ranging from simple Fisher’s method or the minimum p-value approach, to more advanced methods like the Burden Test (BT) [149] or quadratic tests like SKAT [150] with variations in power [151]. Nevertheless, there is a consensus regarding the importance of incorporating LD information of the nearby variants into the methods for controlling the type I error rate at the desired level [20].
VEGAS, GATES, fastBAT and GCTA are among the oldest tools available for summary data, which remain efficient and widely used. SKAT (Sequence Kernel Association Test) is a well-known regression method for testing association between variants and traits adjusting for covariates. As a score-based variance-component test, it calculates p-values analytically by fitting the null model containing only the covariates [150]. The original SKAT method uses only IPD, but later implementations like metaSKAT or SKAT-O have been extended to handle summary data. GCTA and VEGAS also use the multivariate normal framework adjusting the estimates for LD using a reference panel [152, 153]. Of note, GCTA also offers methods for conditional analysis (see “Fine mapping”), and same also holds for KGG [154], whereas VEGAS’s new version allows for mixed ethnicity populations. GATES [155], on the other hand, uses an extended Simes procedure that integrates functional information and association evidence to combine p-values, whereas fastBAT [156] offers fast analytical p-value computations. The gene analysis in MAGMA (Multi-marker Analysis of GenoMic Annotation) is based on a multiple linear principal components’ regression model to account for LD and uses an F-test to compute the overall gene p-value [157]. Its extension, nMAGMA, extends the lists of genes that can be annotated by integrating local signals, long-range regulation signals, and tissue-specific gene networks. It also provides tissue-specific risk signals, which are useful for understanding disorders with multi-tissue origins [158]. H-MAGMA [159] and eMAGMA [160] are two other extensions. The former integrates 3D chromatin configuration, whereas the latter leverages significant tissue-specific cis-eQTL information to assign SNPs to putative genes. EPIC [161] and GAMBIT [162] also utilize functional data for gene-based analysis; the former using cell-type-specific gene expression data obtained from single-cell RNA sequencing and the latter using coding and tissue-specific regulatory annotations. Such methods share several features in common with TWAS methods (see respective section). AgglomerativLD [163] also captures LD between SNPs of nearby genes, which induces correlation of the gene-based test statistics. DOT [164] is one of the few methods that applies a decorrelation-based approach before combining SNP-level statistics or p-values. Tools like GPA [165], oTFisher [166], TS [167] and aSPU [168] implement some type of so-called adaptive tests (AT), that is, they account for possibly varying association patterns across SNPs, whereas some modern tools like MKATR [169], COMBAT [170], MCA [171], OWC [172], FST [173], ACAT [174], HYST [175], GBJ [176] and sumFREGAT [177] perform analysis with multiple statistical methods and test and combine the results. Notably, tools like aSPU [168], snpGeneSets [178], Pascal/PascalX [179, 180], MAGMA, chromMAGMA [181] and FUMA [182], also offer the option of performing gene-set analysis after performing the gene-based analysis (see next section), whereas HSVS-M [168, 183] tests the association of a gene with multiple correlated traits.
Gene Set analysis
Gene set analysis (GSA), or Pathway Analysis, extends the concept of gene-based methods by jointly analyzing groups of functionally related genes and identifying biological pathways enriched with trait-associated genes. By considering the collective impact of multiple genes within a pathway, researchers can obtain a clearer picture of the underlying biological mechanisms influencing the phenotype under investigation. The first applications of such methods borrowed ideas from the microarray data analysis literature, and since then they became widespread in analysis of GWAS [184]. Any GSA method needs to address some issues. Firstly, how to handle SNPs of the same gene; secondly, how to define the appropriate gene-set or pathway, and finally how to combine the effects from multiple SNPs/genes within the same set/pathway [185]. Thus, the choices made by different methods can be very diverse leading to a wide variety of different approaches. For instance, some methods operate with SNP-level statistics (effect sizes, z, or p-values) assigning the SNP to the closest gene (usually within a range of ± 20 K bases), whereas others take as input a gene-level statistic or simply a gene list obtained by a gene-based method (of course, several tools allow for both a gene-based and a GSA approach). Regarding the choice of set there is a plethora of databases containing biological pathways (KEGG, PANTHER etc.), or other types of gene-set representation like PPI interactions, ontologies and so on [186]. Finally, regarding the statistical method used to aggregate evidence there is also a wide range of different methods that handle with different approaches the gene set size and gene length, the LD patterns and the presence of overlapping genes within pathways, or apply different statistical approaches such as those using the so-called competitive null hypothesis, or those using the self-containing one [14, 187]. A tutorial regarding the use of such methods is given in [21].
Among the most easily used and frequently cited are the tools that utilize a webserver. FUMA [182] and iGSE4GWAS [188] are tools specialized in GWAS and use SNP-level statistics as inputs, differing in the subsequent analyses: FUMA uses MAGMA for gene-based testing and allows for ORA and Kologorov-Smirnov test (GSEA), whereas iGSE4GWAS maps the most significant SNP to a gene and then performs an improved GSEA with label permutation to obtain accurate p-values. Tools like Enrichr [189], g:Profiler [190], DAVID [191], WebGestalt [192] and PANTHER [193] are general purpose enrichment tools that provide functionalities for different types of omics data (Fig. 7). They accept gene or SNP-list as input and provide Application Programming Interface (API) ensuring interoperability, whereas for the statistical analysis they all use some version of ORA and/or GSEA (WebGestalt also uses Network Topology-based Analysis). A major feature of these tools is that they incorporate a large number of biological and pathway databases, with g:Profiler and Enrichr offering the most complete collection. GSA-SNP2 is one of the first methods to be developed for GWAS and has seen several improvements regarding the calculation of the combined gene score and the execution time, being among the fastest methods [194]. aSPUpath2 [195] and GIGSEA [196] are two methods that integrate expression data (eQTL) in the pathway analysis. The former uses an adaptive test that extends the aSPU methodology based on chi-square, whereas the latter uses a regression-based approach coupled with permutations to calculate accurate p-values. In a similar fashion, deTS [197] and PGCA perform tissue-specific enrichment analysis (TSEA) for detecting tissue-specific genes and for enrichment test of different forms of query data. Other methods use different definitions of the gene-sets, in some cases utilizing additional information. For instance, dmGWAS [198] integrates PPI networks and uses a search method to identify subnetworks. Compared with standard pathway methods it offers to the users the flexibility in the definition of a gene set and can utilize local PPI information. GEMB [199] defines the gene-sets using gene weights from model predictions and gene ranks from GWAS, and GENOMICper [200] uses permutations of the identified SNPs by rotation with respect to the genomic locations. GWAB [201] uses network connections to reprioritize candidate genes by integrating the GWAS and network data, whereas GenToS [202] searches for trait-associated variants in existing human GWAS. We also need to mention PAPA [203] which is a flexible tool for pleiotropic pathway analysis. As we already mentioned, aSPU, snpGeneSets, PascalX/PASCAL and MAGMA/chromMAGMA are gene-based methods that also perform GSA, whereas MAGENTA is a tool that performs meta-analysis and subsequently GSA (see “meta-analysis”). Lastly, we need to mention Inferno [204] and Mergeomics [205] which are webservers offering a variety of options, extending typical GSA applications. Inferno integrates a variety of functional genomics sources to identify causal noncoding variants using COLOC, WebGestalt, LDSC and MetaXcan. Mergeomics uses summary statistics of multi-omics association studies (GWAS, EWAS, TWAS, PWAS, etc.) and performs correction for LD, GSEA, meta-analysis and identification of regulators of disease-associated pathways and networks.
Fine-mapping
While GWAS can identify broad genomic regions associated with the trait, it doesn't pinpoint the exact causal variant within those regions. Fine mapping, working in the opposite direction of that of the gene-based approaches, is a process aimed at narrowing down and identifying causal variants, that is the specific genetic variants responsible for the observed associations between genomic regions and traits of interest. The plethora of statistical methods and study designs makes it difficult to choose an optimal approach. The different approaches that have been proposed to perform fine-mapping can be divided in three broad categories: heuristic methods that select SNPs based on LD patterns, conditional or penalized regression models that perform variable selection, and Bayesian methods that calculate posterior probabilities or Bayes Factors. Based on theoretical and empirical evidence it seems that Bayesian methods have superior performance [22]. Several factors may influence the performance of fine-mapping approaches, including the true number of causal SNPs in a region and their effect sizes, the local LD structure, the sample size, and the SNP density [22, 206]. Functional annotations are also of great importance leading to the so-called functionally informed fine-mapping (FIFM) methods [206]. The hypothesis of a single causal variant is also very restrictive, and several methods have been developed to allow multiple causal variants in a region as well as to incorporate additional layers of functional annotations, like eQTL [207]. Moreover, methods for fine-mapping of multiple datasets have been proposed, either exploiting different LD patterns across ethnic groups or borrowing information between different traits [207].
As we already noted Bayesian methods seem to have superior performance [22] and thus it is of no surprise that most of the currently available methods operate in a Bayesian framework calculating Posterior Inclusion Probabilities (PIP) and/or Bayes Factors (BFs) in various settings: PAINTOR [208], DAP [209], fgwas [210], FINEMAP [211], flashfm [212], FINMOM [213], CARMA [214] and CAVIAR/CAVIARBF [215]. MsCAVIAR [216] is an extension of the latter method leveraging information from multiple studies, useful in trans-ethnic fine mapping. Similarly, XMAP [217] performs cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias. BEATRICE [218] is a unique method that combines a hierarchical Bayesian model with a deep learning-based inference procedure, whereas RIVIERA-beta [219] performs Bayesian fine-mapping using Epigenomic Reference Annotation. On a different level, PolyFun/PolyLoc [220] do not perform fine-mapping per se but are used for estimating the prior causal probabilities of SNPs, which can then be used by other Bayesian fine-mapping methods. SusieR [221], BVS-PICA [222] and JAM [223], operate also in a Bayesian regression framework performing variable selection and penalized regression. Other regression-based methods, like SOJO [224] and ANNORE [225] work in a frequentist framework and perform lasso-type and differential shrinkage via random effects, respectively, whereas GSR utilizes a gene score regression approach [226] and RSS performs multiple regression utilizing the so-called summary statistics likelihood [227]. AHIUT [228] performs an intersection–union test based on a joint/conditional regression model with all the SNPs in a region. Lastly, we need to mention PICS2 [229], which performs probabilistic identification of causal SNPs and is the only of the methods that is available as a web-server, and echocolatoR [230] which requires minimal input from users and integrates a suite of fine-mapping tools to identify consensus variants, test enrichment and visualize the results.
Analysis of multiple traits
In this section we analyze methods developed for handling multiple traits. Depending on the type of data and the purpose of the analysis the methods can be divided into pleiotropy methods, methods that calculate the genetic correlation, methods for mendelian randomization, transcriptome-wide association and colocalization methods.
Pleiotropy
Pleiotropy is the phenomenon in which a single variant influences several traits [231]. Such methods are of great importance in genetic research and several methods have been developed during the last years. A major goal of such methods is to increase the statistical power over single trait methods. Imagine for instance a variant that produces a near-significant effect when analyzed separately for two or three traits. A method that can combine these estimates may produce significant results. Another application of a joint analysis would be to identify variants that influence both traits, or variants that influence only one of them. When all the relevant variants are considered, one can also estimate the kind of relationship between the traits (see “genetic correlation”). A review of the statistical methods to detect pleiotropy in complex traits can be found in [25]. Usually, the methods that allow for multiple trait analysis are oriented toward quantitative traits like BMI, SBP, DBP and so on, that traditionally are measured on a single cohort, resulting in the existence of cross-trait correlation that needs to be taken into account in the analysis. However, there are also methods for performing the same analysis with summary estimates derived from different cohorts, as well as methods that allow for binary traits with the case–control design, using overlapped or non-overlapped controls.
All methods base their inference on the assumption that the z-statistics follow a multivariate normal distribution (MVN) and perform different types of tests and/or different procedures to estimate or approximate the correlation structure. ACA [232] one of the first methods, estimates the traits covariance from a subset of the phenotypic data or from published studies, p_ACT [233] integrates the MVN using the trait correlation, PAT [234] uses a likelihood-ratio test, and PLEI [235] uses the union-intersection testing method, but in addition to the likelihood ratio test, it also applies generalized estimating equations under the working independence model; it can be applied for both marginal analysis and conditional analysis. USAT [236] uses a score-based test, JaSPU [237] uses an adaptive test which is robust to violations of the MVN assumptions and MTAR [238] uses a Principal Components (PC)-based test. BMASS [239] on the other hand is a Bayesian multivariate method, whereas TWT [240], MTAFS [241] and EBMMT [242], which are among the newer tools, perform a Cauchy Combined Test (CCT) to handle the correlation structure and obtain accurate p-values. SHAHER [243] uses a linear combination of traits by maximizing the proportion of its genetic variance explained by the shared variants and allows both shared and unshared variants to be effectively analyzed and HIPO [244] performs heritability-informed power optimization for conducting multi-trait association analysis. HOPS [245] computes a horizontal pleiotropy score by removing correlations between traits caused by vertical pleiotropy and normalizing effect sizes across all traits and PDR [246] performs a pleiotropic decomposition regression to identify shared components and their underlying genetic variants. We also need to mention methods like MTAG [247] and PLEIO [248] which use LDSC and apart from sample overlap also allow data from multiple studies, something that can be considered meta-analysis and methods like MSKAT [249], multiSKAT [250], MGAS [251], MAIUP [252] and MTAR (multi-trait analysis of rare variants) [253] which are gene-based methods specialized for multiple traits. Finally, methods like iMAP [254] and graphGPA2 [255] use graphical models and are capable of performing analysis of large number of traits.
On the other hand, there are several methods that assume independence of the studied samples. Most of them are designed for larger analyses of many traits from multiple studies, for instance PolarMorphism [256], JASS [257], gwas-pw [258] and FactorGo [259], sumDAG [260], combGWAS [261] and GCPBayes pipeline [262]. GCPBayes_pipeline uses the functionality of GCPBayes to perform cross-phenotype gene-set analysis between two traits. gwas-pw is used for the joint analysis of two GWAS in order to identify variants influencing both traits. PolarMorphism is based on a transform from Cartesian to polar coordinates and reports a per variant degree of 'sharedness' across traits, whereas FactorGo provides scalable variational factor analysis model that is computationally efficient for large number of traits. JASS provides interactive exploration and visualization of the results of comparison of many traits through a web interface (Fig. 8 A-C), sumDAG goes one step further and constructs phenotype networks by using a Gaussian linear model and a directed acyclic graph, and combGWAS identifies susceptibility variants for comorbid disorders and calculate genetic correlations. EPS [263] and GPA [264] differ in integrating Pleiotropy and functional annotation from eQTL.
Genetic correlation
Genetic correlation is related to pleiotropy and describes the relationship between two traits, that is, the extent to which the genetic variants influencing one trait overlap with the genetic variants associated with the other. It thus can quantify the overall genetic similarity and provide insights into the polygenic genetic architecture of complex traits [23]. As we already saw, analyzing simultaneously multiple traits may increase power in case of horizontal pleiotropy; an additional potential application is to use the estimated correlation in order to establish causality between traits in case of vertical pleiotropy (see also next sections). Since heritability is the proportion of the phenotypic variance explained by genotypic variation it is of no surprise that genetic correlation (or, the genetic covariance) is related to the traits’ heritabilities. Thus, several of the methods for estimating heritability discussed earlier, like HESS and SumHer can also calculate the correlation between traits. The most commonly used method, however, for calculating genetic correlation is LDSC (LD Score Regression). The method originally developed for distinguishing polygenicity from bias by examining the relationship between test statistics and LD score, but it is also used for estimating heritability and genetic correlation [133]. LDSC is also available through the LD Hub server. PCGC-s [265] is an adaptation of stratified LDSC for case–control studies and can also estimate genetic heritability, genetic correlation, and functional enrichment. Another popular tool is GNOVA [266] which calculates annotation-stratified covariance using the method of moments and allows for sample overlap. Its extension, SUPERGNOVA [267] identifies global and local genetic correlations that could provide new insights into the shared genetic basis of many phenotypes. Local correlations, among others, can be also computed using LAVA [268]. HDL [269] is a likelihood-based method which produces more precise estimates. A recent comparison found that LDSC and GNOVA are more similar and robust to LD and sample overlap compared to HDL. HDL provides biased estimates of the genetic covariance in most cases and could not distinguish genetic from non-genetic correlation. Moreover, HDL restricts the users to using the built-in reference panel, and its performs poorly when the number of shared SNPs between reference panel and GWAS is small [24]. Other tools provide somewhat different types of analyses. For instance Popcorn [270] estimates transethnic genetic correlation, GECKO [271] estimates both genetic and environmental covariances, PhenoSD [272] uses LDSC for estimating phenotypic correlations and then performs correction for multiple testing using the spectral decomposition of matrices, whereas LPM [273] is a latent probit model scalable to hundreds of annotations and phenotypes that integrates functional annotations. ccGWAS [274] is a tool for comparing two different disorders with small genetic correlation providing a case-case association test, and RHOGE [275] estimates the genetic correlation between two traits as a function of predicted gene expression effect. LOGOdetect [276] uses scan statistics with an LD score-weighted inner product of local z-scores to identify small segments that harbor local genetic correlation between two traits. DONUTS [277] is a unique method since it operates on summary statistics from families.
Mendelian randomization
Mendelian Randomization (MR) is a method suggested in the pre-GWAS era to investigate causal relationships between two traits, usually a phenotype and a disease [278] using genotype–trait associations to make inferences about environmentally modifiable causes of the traits. In technical terms, MR uses genetic variants as instrumental variables [279] to mimic the random assignment of exposures in a randomized controlled trial, similar to the way Mendel's laws of inheritance dictate the random assortment of alleles during gamete formation. By utilizing the natural randomization of genetic inheritance, MR aims to minimize biases introduced by confounding factors that usually affect observational studies when investigating the association of two traits. Usually, we are interested in a disease and some other intermediate phenotype, or another disease. For instance, the MR approach may involve the relationship between hypertension and BMI, or between hypertension and diabetes. Traditionally MR was performed with one sample (1SMR) using a single variant (usually referred to IPD methods), and subsequently multivariate methods for MR meta-analysis were developed [280]. With the emergence of GWAS these methods evolved to the most commonly used two-sample MR (2SMR) methods that utilize summary data estimates from several variants regarding the genotype–phenotype and genotype-disease association from different samples [26, 281]. To establish connection with the previous sections, MR seeks to analyze correlated traits [282] and to provide evidence for causation, in other words to distinguish vertical from horizontal pleiotropy.
Several standard methods for MR in GWAS with summary data have been made available during the last years: the inverse-variance weighted method (IVW), the various types of median estimators (simple or weighted) and the MR-Egger regression approach. IVW gives consistent estimates only if all the genetic variants in the analysis are valid instruments. The median estimator is consistent even when up to 50% of the information comes from invalid instrumental variables, whereas MR-Egger performs equally well but provides somewhat less precise estimates [283]. These methods are readily available in standard packages like TwoSampleMR [284] and MR [285]. The functionalities of TwoSampleMR are also offered, at least partially, through the webserver of MRBASE [284], which is the only method available as such (see Fig. 8, D). BWMR [286] is a tool that performs MR in a Bayesian framework. Besides the issue of weak instruments which is of importance, most modern methods also aim to perform the MR analysis accounting or correcting for horizontal pleiotropy. For instance, pIVW [287] is an extension of the IVW that accounts simultaneously for weak instruments and balanced horizontal pleiotropy and MRmix [288] uses a mixture approach allowing a fraction of the instruments to have pleiotropic effect on the outcome. Similarly, MRcML [289], MR-LDP [290], MR-Corr2 [291] and MR-PRESSO [292] provide functionalities to account for horizontal pleiotropy, whereas IMRP [293] takes a different approach and searches iteratively for horizontally pleiotropic variants and causal effects. MR-APSS [294] differs in that it performs MR accounting for both pleiotropy and sample structure which seems to be another important confounder (and includes population stratification, cryptic relatedness, and sample overlap); MRlap [295] considers both weak instrument bias and winner's curse, accounting for sample overlap. MR.CUE [296] and TS_LMM [297] offer additional functionality for handling variability of the estimates. LCV [298] is a method that estimates causal associations between traits avoiding confounding by genetic correlation, whereas OMR [299] uses information from all GWAS SNPs for causal inference and JAM-MR [300] performs variable selection and causal effect estimation in MR. CS [301], BiDirectCausal [302], MRCI [303] and LHC-MR [304] constitute another important class of methods since they can identify bidirectional causal effects. Another important extension is offered by methods like MR2 [305], MV-MR [306], MRBEE [307], MVMR-cML [308] and adOMICs [309] which extend the MR framework in the multivariate setting allowing more than one exposures or outcomes, as well as MR-BMA [310] which go one step further performing multivariate MR in a Bayesian framework. Finally, other methods like hJAM [311], MR.RAPS [312] and MRPEA [313] offer more advanced options. hJAM unifies the framework of MR and TWAS and can be applied to correlated instruments and multiple intermediates, MR.RAPS uses a three-sample genome-wide design with many independent genetic instruments across the genome to handle many weak genetic instruments and pleiotropy, whereas MRPEA uses pathway association MR analysis approach using data of environmental exposures.
Colocalization and TWAS
As we already described, the MR approach involves the combination of two types of data, a genotype-disease association, and a genotype–phenotype association. If the phenotype involves gene-expression, that is the result of an eQTL study, then we have two distinct but fundamentally related methods, the Transcriptome-wide association study (TWAS) and the colocalization approach (Fig. 9). TWAS is based on the idea that genetic variants can influence gene expression, which subsequently can affect complex traits or diseases [27]. Thus, the approach uses information from eQTL to identify associations between predicted gene expression levels and complex traits/diseases [314]. Even though there are several different methods, the resemblance to MR is obvious; in fact several methods like SMR that uses a single variant [315], GSMR that uses multiple variants [310], and PMR [316] which can account for correlated instruments, horizontal pleiotropy, and can accommodate both single traits and multiple correlated outcomes, all use the term MR, whereas the authors of TScML [317], which uses two-stage constrained maximum likelihood, which is an extension of 2SLS, explicitly state that can be used for both MR and TWAS analyses. FUSION and S-PrediXcan are the oldest and most widely known methods. FUSION is the current implementation of the first TWAS method [318], whereas S-PrediXcan [319] is the summary-data version of PrediXcan. Xu et al. [320] noted that PrediXcan and TWAS can be viewed as a special case of general association testing with multiple SNPs in a GLM and proposed the so-called sum of powered score (SPU) test implemented in aSPU-TWAS [320]. A subsequent evaluation has shown that the original TWAS statistic is equivalent to an LD-aware version of standard MR [321]. iFunMed [322] and sMIST [323] formulate the problem within the framework of mediator analysis, and similarly PTWAS [324] applies principles from instrumental variables analysis. Comm-S* [325] uses a variational Bayesian EM algorithm and a likelihood ratio test to assess expression-trait association. Its extension Tiss-Comm [326] leverages the co-regulation of genetic variations across different tissues explicitly via a unified probabilistic model and also detects the tissue-specific role of candidate target genes in complex traits. Similar multi-tissue approaches are followed by fQTL [327], sCCA [328] and UTMOST [329]. Primo [330], and OPERA [331] extend further the integration by allowing different types of xQTL data (eQTL, pQTL, mQTL etc.) to allow estimation under different conditions, whereas SUMMIT [332] uses a large eQTL summary-level dataset, penalized regression and Cauchy Combination Test and HMAT [333] aggregates TWAS association tests obtained across multiple gene expression prediction models using the harmonic mean P-value combination (HMP). BGW [334] and ARCHIE [335] are two methods that utilize trans-regulated eQTLs. Other tools use combination of methods, like TIGAR [336] which combines DPR and PrediXcan, whereas others, like JEPEGMIX2‐P [337] or FOCUS [338], perform TWAS using pathway information, or use LD to perform fine-mapping over the gene–trait association signals obtained from TWAS, respectively. Even though the various methods discussed here have different modeling assumptions and many were initially developed to answer different biological questions, a recent technical review of the TWAS methods showed that all can be viewed as versions of the two-sample MR analysis [339]. Indeed, several recent tools like MRLocus [340], TWMR [341], and Mr.MtRobin [342] make explicit use of the MR methodology and jargon in order to perform a sophisticated TWAS. MRLocus performs first a colocalization step to each nearly-LD-independent eQTL, and then performs an MR analysis step across eQTLs. TWMR performs a multi-gene multi-instrument MR approach to identify genes whose expression influence the phenotype. Finally, Mr.MtRobin uses multi-tissue eQTL and a reverse regression random slope mixed model to infer whether a gene is associated with a complex trait. As we have already noticed, webTWAS, apart from the database, also offers a webserver for accessing S-PrediXcan, SMR and UTMOST with user supplied datasets.
Another method that also uses GWAS results along with eQTL data is colocalization. Colocalization approaches are used to assess whether two different traits or diseases share a common causal genetic variant or set of variants at a specific genomic locus [13]. Colocalization analysis identifies genetic variants that show significant association in both GWAS and eQTL studies. However, unlike TWAS, it does not perform gene expression prediction and gene-trait association tests, but it focuses on the colocalized SNPs [28]. TWAS and colocalization are related approaches but not identical, since it has been shown that may give different results under different conditions (for instance in case of horizontal pleiotropy) and thus it has been suggested that they should be used complementary [28, 343]. COLOC was one of the first methods for colocalization and has seen several improvements [344, 345] (see also Fig. 9). The latest version uses SuSiE and allows evidence for association at multiple causal variants to be evaluated simultaneously, while at the same time separating the statistical support for each variant conditional on the causal signal being considered. MOLOC [346] is multiple-trait version of COLOC, operating in a Bayesian framework that integrates GWAS summary data with multiple xQTL data to identify regulatory effects, HyPrColoc [347] is a deterministic Bayesian method that detects colocalization across large numbers of traits, and SS2 [348] operates across any number of gene-tissue pairs allowing for sample overlap. LLR [349] works for colocalizing genetic risk variants in multiple GWAS and phenotypes, whereas POEMcoloc [350] is an approximation to the COLOC method that can be applied when limited data are available. SparkINFERNO [351], PwCoCo [352] and ColocQuiaL [353] are pipelines offering additional functionalities, all using COLOC. eCAVIAR is another popular method [354] that uses a probabilistic model that accounts for more than one causal variant at a given locus. MSG [355] increases the power using a spliced gene approach and SharePro [356] integrates LD modeling and colocalization assessment to account for multiple causal variants in colocalization analysis. PESCA [357] uses estimates of LD that are ancestry-matched, in order to infer proportions of population-specific and shared causal variants in two populations. These estimates are then used as priors in an empirical Bayes framework for colocalization and test for enrichment of these causal variants in loci of interest. Lastly, we have to mention the methods that operate as webservers offering ease of use. Sherlock [358] which is also one of the oldest methods, uses a database of eQTL associations from different tissues to identify genetic signatures that match those for specific genes. Unlike other methods it incorporates information from both cis- and trans- eQTL SNPs. LocusFocus [359] is a web-based colocalization tool that tests colocalization using the Simple Sum method to identify relevant genes and tissues for a particular GWAS locus in the presence of high linkage disequilibrium and/or allelic heterogeneity. Regarding the analysis of eQTL data, ezQTL [360] is a webserver performing various tasks like data quality control for variants matched between different datasets, LD visualization, and colocalization analysis using eCAVIAR and HyPrColoc, whereas BAGEA [361] uses a variational Bayes framework to model cis-eQTLs using directed and undirected genomic annotations.
Conclusions
Summary statistics offer protection of privacy over IPD, as well as significant advantages in computational cost, which does not scale with the number of individuals in the study [11]. Naturally, in the post-GWAS era it is expected that a large number of methods would be developed to perform analysis using the summary results of GWAS [11]. The particular methods, integrating data from multiple sources such as LD, gene expression and biological pathways, aim to provide biological insight and improve our understanding about the functional role of identified variants [12,13,14,15]. One thing which we should emphasize is the fact that GWAS summary statistics are not mere replacements for IPD. Of course, some types of analysis can be applied using both summary data or IPD, like meta-analysis, heritability analysis, fine-mapping and so on. In such cases the summary data methods greatly enhance the applicability and the ease of use overcoming the limitations of IPD mentioned earlier. However, methods for other types of analysis, and particularly those that use multiple datasets, like TWAS, colocalization or Mendelian Randomization were designed having in mind the summary data and the integration of data from multiple sources. This is exactly the spirit of the so-called post-GWAS analysis that brought bioinformatics into a central role in genetics research [11]. Most of the “success stories” in GWAS during the last years can be attributed to the development and the application of such methods in identifying new variants, in functional annotation, causal discovery or even in medical applications [2, 12, 362].
In this work we conducted, for the first time in the literature, a systematic review in order to identify software tools and databases dedicated to GWAS summary data analysis. We categorized the tools and databases by their functionality, in categories related to data, single-trait analysis, and multiple-trait analysis, along with their sub-categories which we analyzed and reviewed. We also compared the tools and databases based on their features, limitations, and user-friendliness. Our review identified a wide range of tools, each with unique strengths and limitations. We provided descriptions of the key features of each tool and database, including their input/output formats, data types, and computational requirements. We also discussed the overall usability and applicability of each tool for different research scenarios. We identified families of related tools for performing different or complementary tasks, for instance the CAVIAR tools (CAVIAR, CAVIARBF, msCAVIAR, eCAVIAR), the EpiXcan tools (S-MultiXcan, S-PrediXcan), the LDAK programs (SumHer, GBAT), the MAGMA tools (nMAGMA, H-MAGMA, eMAGMA) and so on. We need to emphasize that in many cases a tool, originally developed for IPD, is later adapted to handle summary data, whereas in other cases a tool is succeeded by a newer version with added capabilities. For instance, the original PrediXcan method uses only IPD, but it is now considered deprecated. S-PrediXcan and S-MultiXcan are later versions that are designed to be used with summary data. The same is the case regarding SKAT. The original method uses only IPD, but later implementations like metaSKAT or SKAT-O allow for summary data as well. At the same time, it is of importance that there are several tools that combine different functionalities. For instance there are tools that can perform meta-analysis and GSA (MAGENTA), gene-based methods that also offer functionalities for conditional analysis (GCTA), methods for analysis of multiple traits with gene-based tests (multiSKAT, MSKAT), methods that can be seen both as methods for multiple-traits or as meta-analysis (PLEIO, PASCAL), methods that perform both GSA and gene-based tests (aSPU, snpGeneSets, PascalX, PASCAL,MAGMA, FUMA). Of course, there are several single-purpose methods that use and combine different statistical tests or different methods (OWC, MCA, TWT, EBMMT, COMBAT, sumFREGAT, MKATR), and we may not forget methods like LDSC, with its variants, which was originally developed for distinguishing polygenicity from bias, but it is also used for estimating heritability and genetic correlation being integrated in many other tools and pipelines.
As we already mentioned, the tools and databases included in the study were those with a functioning URL. In many publications identified through the literature search the URL was not working. In some situations, we recovered a valid link by performing google searches, or by identifying the authors’ websites, but in many cases, this was not enough. Similarly, several tools deposited in CRAN had been removed or archived. This kind of problem is something already known in the scientific community for years [363,364,365]. However, there is more to it. Even for the tools included in the review we could not verify without proper testing that they all work seamlessly, especially for the older ones [366]. Operating systems evolve, programming languages change, and with these the dependencies of each software also change. Even though there are available best practices [367], it is not always realistic to expect complex software to work forever without maintenance. Even for some of the tools having valid URLs, for instance deposited on GitHub, or on personal web pages, we found statements by the authors indicating that the software is no longer maintained and that it is not easy to provide technical support. It is clear that more advanced solutions should be pursued. For instance, among the tools we identified the majority are written in R and Python, but only a handful is available as a webserver: ten of the tools for GSA, three tools for colocalization, two tools for meta-analysis, and one for pleiotropy analysis, MR and fine-mapping. Of course, several of the secondary databases we identified also provide the functionality of performing the analysis using data provided by the user (webTWAS, TSEA-DB, PCGA), but even counting these the proportion of web-tools is rather low (< 10%). Web servers and web services have become of high relevance to the field of bioinformatics during the last 20 years [368], so it is expected to have an increasing number of relevant webservers in the near future as relevant tools are available to facilitate the incorporation of existing applications [369,370,371,372]. On the other hand, some tools may be too computationally demanding, so other solutions must be found. Container-based applications [373, 374] such as Docker can simplify maintenance procedures and add to the reproducibility of research [375]. Community efforts such as udocker [376] may promote usability of complex software tools by non-experts in multi-user environments.
As data accumulates it is unavoidable to head to analyses on an even larger scale. Traditionally the large-scale analysis of many gene-disease associations is modeled by the so-called diseasome [377, 378] using graph theoretic methods [379, 380]. The gene-disease network is composed of pairwise associations obtained from public databases and is a bipartite network [379] consisting of two separate sets of nodes and the interactions between nodes belonging to the different sets. The projection to the one or the other of the sets may lead to the gene–gene or the disease-disease projected networks that inform us about the associations between members of the same set (for instance, two diseases are connected if they share common genes, and so on). Such methods are available for years, but they treat the associations as fixed inputs to the graph. As data accumulate and even more complex statistical methods are developed that allow cross-trait comparisons and combined analyses of multiple traits, along with the integration of different types of data such as xQTL, it is tempting to speculate that a fusion of these two traditions may come, in which the statistical formalism of the tools presented in this review will merge with the graph theoretic approaches developed in the systems biology literature. For instance, we may see network approaches leading to causal analyses (similar to MR) that consider simultaneously all the diseases and traits for which we have GWAS summary data, or similar approaches that integrate xQTL data of various types, different tissues and so on.
We hope that this comprehensive review will serve as a valuable resource for researchers who are interested in using GWAS summary statistics to investigate the genetic basis of complex traits and diseases, as well as to methodologists that develop and test relevant methods. We provided a detailed overview of the available tools and databases, and we hope that this work will facilitate informed tool selection and will maximize the effectiveness of using GWAS summary statistics.
Availability of data and materials
The data collected in this study are available in Supplementary Material. Supplementary Table 1 contains the list with the identified tools along with the URLs, the references and the descriptions. Supplementary Table 2 contains the list with the additional datasets identified in various consortia.
References
Uffelmann E, Huang QQ, Munung NS, de Vries J, Okada Y, Martin AR, et al. Genome-wide association studies. Nature Reviews Methods Primers. 2021;1(1):59.
Abdellaoui A, Yengo L, Verweij KJH, Visscher PM. 15 years of GWAS discovery: Realizing the promise. Am J Hum Genet. 2023;110(2):179–94.
Ziegler A, Konig IR, Thompson JR. Biostatistical aspects of genome-wide association studies. Biom J. 2008;50(1):8–28.
Alsheikh AJ, Wollenhaupt S, King EA, Reeb J, Ghosh S, Stolzenburg LR, et al. The landscape of GWAS validation; systematic review identifying 309 validated non-coding variants across 130 human diseases. BMC Med Genomics. 2022;15(1):74.
Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26(4):445–55.
Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4(8): e1000167.
Craig DW, Goor RM, Wang Z, Paschall J, Ostell J, Feolo M, et al. Assessing and managing risk when sharing aggregate genetic variant data. Nat Rev Genet. 2011;12(10):730–6.
Cai R, Hao Z, Winslett M, Xiao X, Yang Y, Zhang Z, et al. Deterministic identification of specific individuals from GWAS results. Bioinformatics. 2015;31(11):1701–7.
Thelwall M, Munafo M, Mas-Bleda A, Stuart E, Makita M, Weigert V, et al. Is useful research data usually shared? An investigation of genome-wide association study summary statistics. PLoS ONE. 2020;15(2): e0229578.
Reales G, Wallace C. Sharing GWAS summary statistics results in more citations. Commun Biol. 2023;6(1):116.
Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nat Rev Genet. 2017;18(2):117–27.
Gallagher MD, Chen-Plotkin AS. The Post-GWAS Era: From Association to Function. Am J Hum Genet. 2018;102(5):717–30.
Cano-Gamez E, Trynka G. From GWAS to Function: Using Functional Genomics to Identify the Mechanisms Underlying Complex Diseases. Front Genet. 2020;11:424.
Chimusa ER, Dalvie S, Dandara C, Wonkam A, Mazandu GK. Post genome-wide association analysis: dissecting computational pathway/network-based approaches. Brief Bioinform. 2019;20(2):690–700.
Ishigaki K. Beyond GWAS: from simple associations to functional insights. Semin Immunopathol. 2022;44(1):3–14.
Begum F, Ghosh D, Tseng GC, Feingold E. Comprehensive literature review and statistical considerations for GWAS meta-analysis. Nucleic Acids Res. 2012;40(9):3777–84.
Ioannidis JP, Rosenberg PS, Goedert JJ, O'Brien TR, International Meta-analysis of HIVHG. Commentary: meta-analysis of individual participants' data in genetic epidemiology. Am J Epidemiol. 2002;156(3):204–10.
Tang M, Wang T, Zhang X. A review of SNP heritability estimation methods. Brief Bioinform. 2022;23(3).
Zhu H, Zhou X. Statistical methods for SNP heritability estimation and partition: A review. Comput Struct Biotechnol J. 2020;18:1557–68.
Cinar O, Viechtbauer W. A Comparison of Methods for Gene-Based Testing That Account for Linkage Disequilibrium. Front Genet. 2022;13: 867724.
Mooney MA, Wilmot B. Gene set analysis: A step-by-step guide. Am J Med Genet B Neuropsychiatr Genet. 2015;168(7):517–27.
Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat Rev Genet. 2018;19(8):491–504.
van Rheenen W, Peyrot WJ, Schork AJ, Lee SH, Wray NR. Genetic correlations of polygenic disease traits: from theory to practice. Nat Rev Genet. 2019;20(10):567–81.
Zhang Y, Cheng Y, Jiang W, Ye Y, Lu Q, Zhao H. Comparison of methods for estimating genetic correlation between complex traits using GWAS summary statistics. Brief Bioinform. 2021;22(5).
Hackinger S, Zeggini E. Statistical methods to detect pleiotropy in human complex traits. Open Biol. 2017;7(11).
Boehm FJ, Zhou X. Statistical methods for Mendelian randomization in genome-wide association studies: A review. Comput Struct Biotechnol J. 2022;20:2338–51.
Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet. 2019;51(4):592–9.
Hukku A, Sampson MG, Luca F, Pique-Regi R, Wen X. Analyzing and reconciling colocalization and transcriptome-wide association studies from the perspective of inferential reproducibility. Am J Hum Genet. 2022;109(5):825–37.
MacArthur JAL, Buniello A, Harris LW, Hayhurst J, McMahon A, Sollis E, et al. Workshop proceedings: GWAS summary statistics standards and sharing. Cell Genom. 2021;1(1).
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372: n71.
Hayhurst J, Buniello A, Harris L, Mosaku A, Chang C, Gignoux CR, et al. A community driven GWAS summary statistics standard. bioRxiv. 2023:2022.07.15.500230.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8.
Lyon MS, Andrews SJ, Elsworth B, Gaunt TR, Hemani G, Marcora E. The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol. 2021;22(1):32.
Elsworth B, Lyon M, Alexander T, Liu Y, Matthews P, Hallett J, et al. The MRC IEU OpenGWAS data infrastructure. bioRxiv. 2020:2020.08.10.244293.
van der Most PJ, Vaez A, Prins BP, Munoz ML, Snieder H, Alizadeh BZ, et al. QCGWAS: A flexible R package for automated quality control of genome-wide association results. Bioinformatics. 2014;30(8):1185–6.
Fuchsberger C, Taliun D, Pramstaller PP, Pattaro C. GWAtoolbox: an R package for fast quality control and handling of genome-wide association studies meta-analysis data. Bioinformatics. 2012;28(3):444–5.
Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, Mägi R, et al. Quality control and conduct of genome-wide association meta-analyses. Nat Protoc. 2014;9(5):1192–212.
Chen GB, Lee SH, Robinson MR, Trzaskowski M, Zhu ZX, Winkler TW, et al. Across-cohort QC analyses of GWAS summary statistics from complex traits. Eur J Hum Genet. 2016;25(1):137–46.
Murphy AE, Schilder BM, Skene NG. MungeSumstats: a Bioconductor package for the standardization and quality control of many GWAS summary statistics. Bioinformatics. 2021;37(23):4593–6.
He Y, Koido M, Shimmori Y, Kamatani Y. GWASLab: a Python package for processing and visualizing GWAS summary statistics. 2023.
Matushyn M, Bose M, Mahmoud AA, Cuthbertson L, Tello C, Bircan KO, et al. SumStatsRehab: an efficient algorithm for GWAS summary statistics assessment and restoration. BMC Bioinformatics. 2022;23(1):443.
Ani A, van der Most PJ, Snieder H, Vaez A, Nolte IM. GWASinspector: comprehensive quality control of genome-wide association study results. Bioinformatics. 2021;37(1):129–30.
Awasthi S, Chen CY, Lam M, Huang H, Ripke S, Altar CA. GWAS quality score for evaluating associated regions in GWAS analyses. Bioinformatics. 2023;39(1).
Chen W, Wu Y, Zheng Z, Qi T, Visscher PM, Zhu Z, et al. Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors. Nat Commun. 2021;12(1):7117.
Williams CM, Poore H, Tanksley PT, Kweon H, Courchesne-Krak NS, Londono-Correa D, et al. Guidelines for Evaluating the Comparability of Down-Sampled GWAS Summary Statistics. Behav Genet. 2023;53(5–6):404–15.
Baxevanis AD, Bateman A. The Importance of Biological Databases in Biological Discovery. Curr Protoc Bioinformatics. 2015;50:1–8.
Ison J, Rapacki K, Menager H, Kalas M, Rydza E, Chmura P, et al. Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Res. 2016;44(D1):D38-47.
Rigden DJ, Fernandez XM. The 27th annual Nucleic Acids Research database issue and molecular biology database collection. Nucleic Acids Res. 2020;48(D1):D1–8.
Zou D, Ma L, Yu J, Zhang Z. Biological databases for human research. Genomics Proteomics Bioinformatics. 2015;13(1):55–63.
Hassani-Pak K, Rawlings C. Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes. J Integr Bioinform. 2017;14(1).
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet. 2007;39(10):1181–6.
Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–12.
Beck T, Rowlands T, Shorter T, Brookes AJ. GWAS Central: an expanding resource for finding and visualising genotype and phenotype data from genome-wide association studies. Nucleic Acids Res. 2023;51(D1):D986–93.
Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associations in UK Biobank. Nat Genet. 2018;50(11):1593–9.
McInnes G, Tanigawa Y, DeBoever C, Lavertu A, Olivieri JE, Aguirre M, et al. Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics. Bioinformatics. 2019;35(14):2495–7.
Consortium GT. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–30.
Huang D, Feng X, Yang H, Wang J, Zhang W, Fan X, et al. QTLbase2: an enhanced catalog of human quantitative trait loci on extensive molecular phenotypes. Nucleic Acids Res. 2023;51(D1):D1122–8.
Dai Y, Hu R, Manuel AM, Liu A, Jia P, Zhao Z. CSEA-DB: an omnibus for human complex trait and cell type associations. Nucleic Acids Res. 2021;49(D1):D862–70.
Xue C, Jiang L, Zhou M, Long Q, Chen Y, Li X, et al. PCGA: a comprehensive web server for phenotype-cell-gene association analysis. Nucleic Acids Res. 2022;50(W1):W568–76.
Cao C, Wang J, Kwok D, Cui F, Zhang Z, Zhao D, et al. webTWAS: a resource for disease candidate susceptibility genes identified by transcriptome-wide association study. Nucleic Acids Res. 2022;50(D1):D1123–30.
Pan S, Kang H, Liu X, Li S, Yang P, Wu M, et al. COLOCdb: a comprehensive resource for multi-model colocalization of complex traits. Nucleic Acids Res. 2024;52(D1):D871–81.
Watanabe K, Stringer S, Frei O, Umicevic Mirkov M, de Leeuw C, Polderman TJC, et al. A global overview of pleiotropy and genetic architecture in complex traits. Nat Genet. 2019;51(9):1339–48.
Patron J, Serra-Cayuela A, Han B, Li C, Wishart DS. Assessing the performance of genome-wide association studies for predicting disease risk. PLoS ONE. 2019;14(12): e0220215.
Bastarache L, Denny JC, Roden DM. Phenome-Wide Association Studies. JAMA. 2022;327(1):75–6.
Verma A, Ritchie MD. Current Scope and Challenges in Phenome-Wide Association Studies. Curr Epidemiol Rep. 2017;4(4):321–9.
Wang L, Zhang X, Meng X, Koskeridis F, Georgiou A, Yu L, et al. Methodology in phenome-wide association studies: a systematic review. J Med Genet. 2021;58(11):720–8.
Kamat MA, Blackshaw JA, Young R, Surendran P, Burgess S, Danesh J, et al. PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations. Bioinformatics. 2019;35(22):4851–3.
Denny JC, Bastarache L, Ritchie MD, Carroll RJ, Zink R, Mosley JD, et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat Biotechnol. 2013;31(12):1102–10.
Zheng J, Erzurumluoglu AM, Elsworth BL, Kemp JP, Howe L, Haycock PC, et al. LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. Bioinformatics. 2017;33(2):272–9.
Li Y, Willer C, Sanna S, Abecasis G. Genotype imputation. Annu Rev Genomics Hum Genet. 2009;10:387–406.
Marchini J, Howie B. Genotype imputation for genome-wide association studies. Nat Rev Genet. 2010;11(7):499–511.
Naj AC. Genotype Imputation in Genome-Wide Association Studies. Curr Protoc Hum Genet. 2019;102(1): e84.
Dickhaus T, Stange J, Demirhan H. On an extended interpretation of linkage disequilibrium in genetic case-control association studies. Stat Appl Genet Mol Biol. 2015;14(5):497–505.
Kwan JS, Li MX, Deng JE, Sham PC. FAPI: Fast and accurate P-value Imputation for genome-wide association study. Eur J Hum Genet. 2016;24(5):761–6.
Pasaniuc B, Zaitlen N, Shi H, Bhatia G, Gusev A, Pickrell J, et al. Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics. 2014;30(20):2906–14.
Julienne H, Shi H, Pasaniuc B, Aschard H. RAISS: robust and accurate imputation from summary statistics. Bioinformatics. 2019;35(22):4837–9.
Lee D, Bigdeli TB, Williamson VS, Vladimirov VI, Riley BP, Fanous AH, et al. DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts. Bioinformatics. 2015;31(19):3099–104.
Rueger S, McDaid A, Kutalik Z. Evaluation and application of summary statistic imputation to discover new height-associated loci. PLoS Genet. 2018;14(5): e1007371.
Xu Z, Duan Q, Yan S, Chen W, Li M, Lange E, et al. DISSCO: direct imputation of summary statistics allowing covariates. Bioinformatics. 2015;31(15):2434–42.
Lee D, Bigdeli TB, Riley BP, Fanous AH, Bacanu SA. DIST: direct imputation of summary statistics for unmeasured SNPs. Bioinformatics. 2013;29(22):2925–7.
Togninalli M, Roqueiro D, Investigators CO, Borgwardt KM. Accurate and adaptive imputation of summary statistics in mixed-ethnicity cohorts. Bioinformatics. 2018;34(17):i687–96.
Park DS, Brown B, Eng C, Huntsman S, Hu D, Torgerson DG, et al. Adapt-Mix: learning local genetic correlation structure improves summary statistics-based analyses. Bioinformatics. 2015;31(12):i181–9.
Ren J, Lin Z, Pan W. Integrating GWAS summary statistics, individual-level genotypic and omic data to enhance the performance for large-scale trait imputation. Hum Mol Genet. 2023;32(17):2693–703.
Ren J, Lin Z, He R, Shen X, Pan W. Using GWAS summary data to impute traits for genotyped individuals. HGG Adv. 2023;4(3): 100197.
Yang Z, Paschou P, Drineas P. Reconstructing SNP allele and genotype frequencies from GWAS summary statistics. Sci Rep. 2022;12(1):8242.
Bagos PG, Nikolopoulos GK. A method for meta-analysis of case-control genetic association studies using logistic regression. Stat Appl Genet Mol Biol. 2007;6:Article17.
Bagos PG. A unification of multivariate methods for meta-analysis of genetic association studies. Stat Appl Genet Mol Biol. 2008;7(1):Article31.
Bagos PG. Genetic model selection in genome-wide association studies: robust methods and the use of meta-analysis. Stat Appl Genet Mol Biol. 2013;12(3):285–308.
Dimou NL, Tsirigos KD, Elofsson A, Bagos PG. GWAR: robust analysis and meta-analysis of genome-wide association studies. Bioinformatics. 2017;33(10):1521–7.
Di Pietrantonj C. Four-fold table cell frequencies imputation in meta analysis. Stat Med. 2006;25(13):2299–322.
Nolte IM. Metasubtract: an R-package to analytically produce leave-one-out meta-analysis GWAS summary statistics. Bioinformatics. 2020;36(16):4521–2.
Woolf B, Sallis HM, Munafò MR, Gill D. Deriving GWAS summary estimates for paternal smoking in UK biobank: a GWAS by subtraction. BMC Res Notes. 2023;16(1):159.
Niu YF, Ye C, He J, Han F, Guo LB, Zheng HF, et al. Reproduction and In-Depth Evaluation of Genome-Wide Association Studies and Genome-Wide Meta-analyses Using Summary Statistics. G3 (Bethesda). 2017;7(3):943–52.
Lloyd-Jones LR, Robinson MR, Yang J, Visscher PM. Transformation of Summary Statistics from Linear Mixed Model Association on All-or-None Traits to Odds Ratio. Genetics. 2018;208(4):1397–408.
Forero DA, Lopez-Leon S, González-Giraldo Y, Bagos PG. Ten simple rules for carrying out and writing meta-analyses. PLoS Comput Biol. 2019;15(5): e1006922.
Lin DY, Zeng D. Meta-analysis of genome-wide association studies: no efficiency gain in using individual participant data. Genet Epidemiol. 2010;34(1):60–6.
Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, et al. Meta-analysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008;27(11):1870–93.
Dai M, Ming J, Cai M, Liu J, Yang C, Wan X, et al. IGESS: a statistical approach to integrating individual-level genotype data and summary statistics in genome-wide association studies. Bioinformatics. 2017;33(18):2882–9.
Fu S, Deng L, Zhang H, Qin J, Yu K. Integrative analysis of individual-level data and high-dimensional summary statistics. Bioinformatics. 2023;39(4).
Dai M, Wan X, Peng H, Wang Y, Liu Y, Liu J, et al. Joint analysis of individual-level and summary-level GWAS data by leveraging pleiotropy. Bioinformatics. 2019;35(10):1729–36.
Fu S, Purdue MP, Zhang H, Qin J, Song L, Berndt SI, et al. Improve the model of disease subtype heterogeneity by leveraging external summary data. PLoS Comput Biol. 2023;19(7): e1011236.
Evangelou E, Ioannidis JP. Meta-analysis methods for genome-wide association studies and beyond. Nat Rev Genet. 2013;14(6):379–89.
Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010;26(17):2190–1.
Mägi R, Morris AP. GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics. 2010;11:288.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Meesters C, Leber M, Herold C, Angisch M, Mattheisen M, Drichel D, et al. Quick, “imputation-free” meta-analysis with proxy-SNPs. BMC Bioinformatics. 2012;13:231.
Jiang Y, Chen S, McGuire D, Chen F, Liu M, Iacono WG, et al. Proper conditional analysis in the presence of missing data: Application to large scale meta-analysis of tobacco use phenotypes. PLoS Genet. 2018;14(7): e1007452.
Jiang W, Yu W. Jointly determining significance levels of primary and replication studies by controlling the false discovery rate in two-stage genome-wide association studies. Stat Methods Med Res. 2018;27(9):2795–808.
Jiang W, Yu W. Controlling the joint local false discovery rate is more powerful than meta-analysis methods in joint analysis of summary statistics from multiple genome-wide association studies. Bioinformatics. 2017;33(4):500–7.
Jiang W, Xue JH, Yu W. What is the probability of replicating a statistically significant association in genome-wide association studies? Brief Bioinform. 2017;18(6):928–39.
Xie Y, Zhai S, Jiang W, Zhao H, Mehrotra DV, Shen J. Statistical assessment of biomarker replicability using MAJAR method. Stat Methods Med Res. 2023;32(10):1961–72.
de Vlaming R, Okbay A, Rietveld CA, Johannesson M, Magnusson PK, Uitterlinden AG, et al. Meta-GWAS Accuracy and Power (MetaGAP) Calculator Shows that Hiding Heritability Is Partially Due to Imperfect Genetic Correlations across Studies. PLoS Genet. 2017;13(1): e1006495.
Province MA, Borecki IB. A correlated meta-analysis strategy for data mining "OMIC" scans. Pac Symp Biocomput. 2013:236–46.
Segrè AV, Groop L, Mootha VK, Daly MJ, Altshuler D. Common inherited variation in mitochondrial genes is not enriched for associations with type 2 diabetes or related glycemic traits. PLoS Genet. 2010;6(8).
Sun J, Lyu R, Deng L, Li Q, Zhao Y, Zhang Y. SMetABF: A rapid algorithm for Bayesian GWAS meta-analysis with a large number of studies included. PLoS Comput Biol. 2022;18(3): e1009948.
Trochet H, Pirinen M, Band G, Jostins L, McVean G, Spencer CCA. Bayesian meta-analysis across genome-wide association studies of diverse phenotypes. Genet Epidemiol. 2019;43(5):532–47.
Baselmans BML, Jansen R, Ip HF, van Dongen J, Abdellaoui A, van de Weijer MP, et al. Multivariate genome-wide analyses of the well-being spectrum. Nat Genet. 2019;51(3):445–51.
Cichonska A, Rousu J, Marttinen P, Kangas AJ, Soininen P, Lehtimäki T, et al. metaCCA: summary statistics-based multivariate meta-analysis of genome-wide association studies using canonical correlation analysis. Bioinformatics. 2016;32(13):1981–9.
Zhu X, Feng T, Tayo BO, Liang J, Young JH, Franceschini N, et al. Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension. Am J Hum Genet. 2015;96(1):21–36.
Ray D, Boehnke M. Methods for meta-analysis of multiple traits using GWAS summary statistics. Genet Epidemiol. 2018;42(2):134–45.
Baghfalaki T, Sugier PE, Truong T, Pettitt AN, Mengersen K, Liquet B. Bayesian meta-analysis models for cross cancer genomic investigation of pleiotropic effects using group structure. Stat Med. 2021;40(6):1498–518.
John M, Lencz T, Malhotra AK, Correll CU, Zhang JP. A simulations approach for meta-analysis of genetic association studies based on additive genetic model. Meta Gene. 2018;16:143–64.
Nasirigerdeh R, Torkzadehmahani R, Matschinske J, Frisch T, List M, Späth J, et al. sPLINK: a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies. Genome Biol. 2022;23(1):32.
Coram MA, Candille SI, Duan Q, Chan KH, Li Y, Kooperberg C, et al. Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach. Am J Hum Genet. 2015;96(5):740–52.
Tenesa A, Haley CS. The heritability of human disease: estimation, uses and abuses. Nat Rev Genet. 2013;14(2):139–49.
Visscher PM, Hill WG, Wray NR. Heritability in the genomics era–concepts and misconceptions. Nat Rev Genet. 2008;9(4):255–66.
Barry CS, Walker VM, Cheesman R, Davey Smith G, Morris TT, Davies NM. How to estimate heritability: a guide for genetic epidemiologists. Int J Epidemiol. 2023;52(2):624–32.
Zaitlen N, Kraft P. Heritability in the genome-wide association era. Hum Genet. 2012;131(10):1655–64.
So HC, Gui AH, Cherny SS, Sham PC. Evaluating the heritability explained by known susceptibility variants: a survey of ten complex diseases. Genet Epidemiol. 2011;35(5):310–7.
So HC, Li M, Sham PC. Uncovering the total heritability explained by all true susceptibility variants in a genome-wide association study. Genet Epidemiol. 2011;35(6):447–56.
Palla L, Dudbridge F. A Fast Method that Uses Polygenic Scores to Estimate the Variance Explained by Genome-wide Marker Panels and the Proportion of Variants Affecting a Trait. Am J Hum Genet. 2015;97(2):250–9.
Shi H, Kichaev G, Pasaniuc B. Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data. Am J Hum Genet. 2016;99(1):139–53.
Bulik-Sullivan BK, Loh PR, Finucane HK, Ripke S, Yang J, Patterson N, et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat Genet. 2015;47(3):291–5.
Song S, Jiang W, Zhang Y, Hou L, Zhao H. Leveraging LD eigenvalue regression to improve the estimation of SNP heritability and confounding inflation. Am J Hum Genet. 2022;109(5):802–11.
Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh PR, et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015;47(11):1228–35.
Speed D, Balding DJ. SumHer better estimates the SNP heritability of complex traits from summary statistics. Nat Genet. 2019;51(2):277–84.
Li H, Mazumder R, Lin X. Accurate and efficient estimation of local heritability using summary statistics and the linkage disequilibrium matrix. Nat Commun. 2023;14(1):7954.
Laville V, Bentley AR, Privé F, Zhu X, Gauderman J, Winkler TW, et al. VarExp: estimating variance explained by genome-wide GxE summary statistics. Bioinformatics. 2018;34(19):3412–4.
Shin J, Lee SH. GxEsum: a novel approach to estimate the phenotypic variance explained by genome-wide GxE interaction based on GWAS summary statistics for biobank-scale data. Genome Biol. 2021;22(1):183.
Song L, Liu A, Shi J. SummaryAUC: a tool for evaluating the performance of polygenic risk prediction models in validation datasets with only summary level statistics. Bioinformatics. 2019;35(20):4038–44.
Chan TF, Rui X, Conti DV, Fornage M, Graff M, Haessler J, et al. Estimating heritability explained by local ancestry and evaluating stratification bias in admixture mapping from summary statistics. Am J Hum Genet. 2023;110(11):1853–62.
Zhang Y, Qi G, Park JH, Chatterjee N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat Genet. 2018;50(9):1318–26.
López-Cortegano E, Caballero A. GWEHS: A Genome-Wide Effect Sizes and Heritability Screener. Genes (Basel). 2019;10(8).
O’Connor LJ. The distribution of common-variant effect sizes. Nat Genet. 2021;53(8):1243–9.
Holland D, Frei O, Desikan R, Fan CC, Shadrin AA, Smeland OB, et al. Beyond SNP heritability: Polygenicity and discoverability of phenotypes estimated with a univariate Gaussian mixture model. PLoS Genet. 2020;16(5): e1008612.
Yao DW, O’Connor LJ, Price AL, Gusev A. Quantifying genetic effects on disease mediated by assayed gene expression levels. Nat Genet. 2020;52(6):626–33.
Siewert-Rocks KM, Kim SS, Yao DW, Shi H, Price AL. Leveraging gene co-regulation to identify gene sets enriched for disease heritability. Am J Hum Genet. 2022;109(3):393–404.
Neale BM, Sham PC. The future of association studies: gene-based analysis and replication. Am J Hum Genet. 2004;75(3):353–62.
Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
Chapman J, Whittaker J. Analysis of multiple SNPs in a candidate gene or region. Genet Epidemiol. 2008;32(6):560–6.
Lee D, Williamson VS, Bigdeli TB, Riley BP, Fanous AH, Vladimirov VI, et al. JEPEG: a summary statistics based tool for gene-level joint testing of functional variants. Bioinformatics. 2015;31(8):1176–82.
Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, et al. Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012;44(4):369–75, s1–3.
Li M, Jiang L, Mak TSH, Kwan JSH, Xue C, Chen P, et al. A powerful conditional gene-based association approach implicated functionally important genes for schizophrenia. Bioinformatics. 2019;35(4):628–35.
Li MX, Gui HS, Kwan JS, Sham PC. GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011;88(3):283–93.
Bakshi A, Zhu Z, Vinkhuyzen AA, Hill WD, McRae AF, Visscher PM, et al. Fast set-based association analysis using summary data from GWAS identifies novel gene loci for human complex traits. Sci Rep. 2016;6:32894.
de Leeuw CA, Mooij JM, Heskes T, Posthuma D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol. 2015;11(4): e1004219.
Yang A, Chen J, Zhao XM. nMAGMA: a network-enhanced method for inferring risk genes from GWAS summary statistics and its application to schizophrenia. Brief Bioinform. 2021;22(4).
Sey NYA, Pratt BM, Won H. Annotating genetic variants to target genes using H-MAGMA. Nat Protoc. 2023;18(1):22–35.
Gerring ZF, Mina-Vargas A, Gamazon ER, Derks EM. E-MAGMA: an eQTL-informed method to identify risk genes using genome-wide association study summary statistics. Bioinformatics. 2021;37(16):2245–9.
Wang R, Lin DY, Jiang Y. EPIC: Inferring relevant cell types for complex traits by integrating genome-wide association studies and single-cell RNA sequencing. PLoS Genet. 2022;18(6): e1010251.
Quick C, Wen X, Abecasis G, Boehnke M, Kang HM. Integrating comprehensive functional annotations to boost power and accuracy in gene-based association analysis. PLoS Genet. 2020;16(12): e1009060.
Yurko R, Roeder K, Devlin B, G'Sell M. An approach to gene-based testing accounting for dependence of tests among nearby genes. Brief Bioinform. 2021;22(6).
Vsevolozhskaya OA, Shi M, Hu F, Zaykin DV. DOT: Gene-set analysis by combining decorrelated association statistics. PLoS Comput Biol. 2020;16(4): e1007819.
Zhang J, Zhao Z, Guo X, Guo B, Wu B. Powerful statistical method to detect disease-associated genes using publicly available genome-wide association studies summary data. Genet Epidemiol. 2019;43(8):941–51.
Chen X, Zhang H, Liu M, Deng HW, Wu Z. Simultaneous detection of novel genes and SNPs by adaptive p-value combination. Front Genet. 2022;13:1009428.
Zhang J, Guo X, Gonzales S, Yang J, Wang X. TS: a powerful truncated test to detect novel disease associated genes using publicly available gWAS summary data. BMC Bioinformatics. 2020;21(1):172.
Kwak IY, Pan W. Gene- and pathway-based association tests for multiple traits with GWAS summary statistics. Bioinformatics. 2017;33(1):64–71.
Guo B, Wu B. Statistical methods to detect novel genetic variants using publicly available GWAS summary data. Comput Biol Chem. 2018;74:76–9.
Wang M, Huang J, Liu Y, Ma L, Potash JB, Han S. COMBAT: A Combined Association Test for Genes Using Summary Statistics. Genetics. 2017;207(3):883–91.
Shao Z, Wang T, Qiao J, Zhang Y, Huang S, Zeng P. A comprehensive comparison of multilocus association methods with summary statistics in genome-wide association studies. BMC Bioinformatics. 2022;23(1):359.
Zhang J, Liang X, Gonzales S, Liu J, Gao XR, Wang X. A gene based combination test using GWAS summary data. BMC Bioinformatics. 2023;24(1):2.
He Z, Xu B, Lee S, Ionita-Laza I. Unified Sequence-Based Association Tests Allowing for Multiple Functional Annotations and Meta-analysis of Noncoding Variation in Metabochip Data. Am J Hum Genet. 2017;101(3):340–52.
Liu Y, Chen S, Li Z, Morrison AC, Boerwinkle E, Lin X. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies. Am J Hum Genet. 2019;104(3):410–21.
Li MX, Kwan JS, Sham PC. HYST: a hybrid set-based test for genome-wide association studies, with application to protein-protein interaction-based association analysis. Am J Hum Genet. 2012;91(3):478–88.
Sun R, Lin X. Genetic Variant Set-Based Tests Using the Generalized Berk-Jones Statistic with Application to a Genome-Wide Association Study of Breast Cancer. J Am Stat Assoc. 2020;115(531):1079–91.
Berrandou TE, Balding D, Speed D. LDAK-GBAT: Fast and powerful gene-based association testing using summary statistics. Am J Hum Genet. 2023;110(1):23–9.
Mei H, Li L, Jiang F, Simino J, Griswold M, Mosley T, et al. snpGeneSets: An R Package for Genome-Wide Study Annotation. G3 (Bethesda). 2016;6(12):4087–95.
Krefl D, Brandulas Cammarata A, Bergmann S. PascalX: a Python library for GWAS gene and pathway enrichment tests. Bioinformatics. 2023;39(5).
Lamparter D, Marbach D, Rueedi R, Kutalik Z, Bergmann S. Fast and Rigorous Computation of Gene and Pathway Scores from SNP-Based Summary Statistics. PLoS Comput Biol. 2016;12(1): e1004714.
Nameki R, Shetty A, Dareng E, Tyrer J, Lin X, Pharoah P, et al. chromMAGMA: regulatory element-centric interrogation of risk variants. Life Sci Alliance. 2022;5(10).
Watanabe K, Taskesen E, van Bochoven A, Posthuma D. Functional mapping and annotation of genetic associations with FUMA. Nat Commun. 2017;8(1):1826.
Yang Y, Basu S, Zhang L. A Bayesian hierarchically structured prior for gene-based association testing with multiple traits in genome-wide association studies. Genet Epidemiol. 2022;46(1):63–72.
Wang K, Li M, Bucan M. Pathway-based approaches for analysis of genomewide association studies. Am J Hum Genet. 2007;81(6):1278–83.
Mooney MA, Nigg JT, McWeeney SK, Wilmot B. Functional and genomic context in pathway analysis of GWAS data. Trends Genet. 2014;30(9):390–400.
Pers TH. Gene set analysis for interpreting genetic studies. Hum Mol Genet. 2016;25(R2):R133–40.
Wang L, Jia P, Wolfinger RD, Chen X, Zhao Z. Gene set analysis of genome-wide association studies: methodological issues and perspectives. Genomics. 2011;98(1):1–8.
Zhang K, Cui S, Chang S, Zhang L, Wang J. i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Nucleic Acids Res. 2010;38(Web Server issue):W90–5.
Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90–7.
Kolberg L, Raudvere U, Kuzmin I, Adler P, Vilo J, Peterson H. g:Profiler-interoperable web service for functional enrichment analysis and gene identifier mapping (2023 update). Nucleic Acids Res. 2023;51(W1):W207–12.
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022;50(W1):W216–21.
Liao Y, Wang J, Jaehnig EJ, Shi Z, Zhang B. WebGestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 2019;47(W1):W199-w205.
Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, et al. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res. 2021;49(D1):D394-d403.
Yoon S, Nguyen HCT, Yoo YJ, Kim J, Baik B, Kim S, et al. Efficient pathway enrichment and network analysis of GWAS summary data using GSA-SNP2. Nucleic Acids Res. 2018;46(10): e60.
Wu C, Pan W. Integrating eQTL data with GWAS summary statistics in pathway-based analysis with application to schizophrenia. Genet Epidemiol. 2018;42(3):303–16.
Zhu S, Qian T, Hoshida Y, Shen Y, Yu J, Hao K. GIGSEA: genotype imputed gene set enrichment analysis using GWAS summary level data. Bioinformatics. 2019;35(1):160–3.
Pei G, Dai Y, Zhao Z, Jia P. deTS: tissue-specific enrichment analysis to decode tissue specificity. Bioinformatics. 2019;35(19):3842–5.
Jia P, Zheng S, Long J, Zheng W, Zhao Z. dmGWAS: dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics. 2011;27(1):95–102.
Cochran AL, Nieser KJ, Forger DB, Zöllner S, McInnis MG. Gene-set Enrichment with Mathematical Biology (GEMB). Gigascience. 2020;9(10).
Cabrera CP, Navarro P, Huffman JE, Wright AF, Hayward C, Campbell H, et al. Uncovering networks from genome-wide association studies via circular genomic permutation. G3 (Bethesda). 2012;2(9):1067–75.
Shim JE, Bang C, Yang S, Lee T, Hwang S, Kim CY, et al. GWAB: a web server for the network-based boosting of human genome-wide association data. Nucleic Acids Res. 2017;45(W1):W154–61.
Hoppmann AS, Schlosser P, Backofen R, Lausch E, Köttgen A. GenToS: Use of Orthologous Gene Information to Prioritize Signals from Human GWAS. PLoS ONE. 2016;11(9): e0162466.
Wen Y, Wang W, Guo X, Zhang F. PAPA: a flexible tool for identifying pleiotropic pathways using genome-wide association study summaries. Bioinformatics. 2016;32(6):946–8.
Amlie-Wolf A, Tang M, Mlynarski EE, Kuksa PP, Valladares O, Katanic Z, et al. INFERNO: inferring the molecular mechanisms of noncoding genetic variants. Nucleic Acids Res. 2018;46(17):8740–53.
Ding J, Blencowe M, Nghiem T, Ha SM, Chen YW, Li G, et al. Mergeomics 2.0: a web server for multi-omics data integration to elucidate disease networks and predict therapeutics. Nucleic Acids Res. 2021;49(W1):W375-w87.
Wang QS, Huang H. Methods for statistical fine-mapping and their applications to auto-immune diseases. Semin Immunopathol. 2022;44(1):101–13.
Hutchinson A, Asimit J, Wallace C. Fine-mapping genetic associations. Hum Mol Genet. 2020;29(R1):R81–8.
Kichaev G, Roytman M, Johnson R, Eskin E, Lindström S, Kraft P, et al. Improved methods for multi-trait fine mapping of pleiotropic risk loci. Bioinformatics. 2017;33(2):248–55.
Wen X, Lee Y, Luca F, Pique-Regi R. Efficient Integrative Multi-SNP Association Analysis via Deterministic Approximation of Posteriors. Am J Hum Genet. 2016;98(6):1114–29.
Pickrell JK. Joint analysis of functional genomic data and genome-wide association studies of 18 human traits. Am J Hum Genet. 2014;94(4):559–73.
Benner C, Spencer CC, Havulinna AS, Salomaa V, Ripatti S, Pirinen M. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics. 2016;32(10):1493–501.
Hernández N, Soenksen J, Newcombe P, Sandhu M, Barroso I, Wallace C, et al. The flashfm approach for fine-mapping multiple quantitative traits. Nat Commun. 2021;12(1):6147.
Karhunen V, Launonen I, Järvelin MR, Sebert S, Sillanpää MJ. Genetic fine-mapping from summary data using a nonlocal prior improves the detection of multiple causal variants. Bioinformatics. 2023;39(7).
Yang Z, Wang C, Liu L, Khan A, Lee A, Vardarajan B, et al. CARMA is a new Bayesian model for fine-mapping in genome-wide association meta-analyses. Nat Genet. 2023;55(6):1057–65.
Chen W, Larrabee BR, Ovsyannikova IG, Kennedy RB, Haralambieva IH, Poland GA, et al. Fine Mapping Causal Variants with an Approximate Bayesian Method Using Marginal Test Statistics. Genetics. 2015;200(3):719–36.
LaPierre N, Taraszka K, Huang H, He R, Hormozdiari F, Eskin E. Identifying causal variants by fine mapping across multiple studies. PLoS Genet. 2021;17(9): e1009733.
Cai M, Wang Z, Xiao J, Hu X, Chen G, Yang C. XMAP: Cross-population fine-mapping by leveraging genetic diversity and accounting for confounding bias. Nat Commun. 2023;14(1):6870.
Ghosal S, Schatz MC, Venkataraman A. BEATRICE: Bayesian Fine-mapping from Summary Data using Deep Variational Inference. bioRxiv. 2023.a
Li Y, Kellis M. Joint Bayesian inference of risk variants and tissue-specific epigenomic enrichments across multiple complex human diseases. Nucleic Acids Res. 2016;44(18): e144.
Weissbrod O, Hormozdiari F, Benner C, Cui R, Ulirsch J, Gazal S, et al. Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nat Genet. 2020;52(12):1355–63.
Zou Y, Carbonetto P, Wang G, Stephens M. Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genet. 2022;18(7): e1010299.
Chen S, Nunez S, Reilly MP, Foulkes AS. Bayesian variable selection for post-analytic interrogation of susceptibility loci. Biometrics. 2017;73(2):603–14.
Newcombe PJ, Conti DV, Richardson S. JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects. Genet Epidemiol. 2016;40(3):188–201.
Ning Z, Lee Y, Joshi PK, Wilson JF, Pawitan Y, Shen X. A Selection Operator for Summary Association Statistics Reveals Allelic Heterogeneity of Complex Traits. Am J Hum Genet. 2017;101(6):903–12.
Fisher V, Sebastiani P, Cupples LA, Liu CT. ANNORE: genetic fine-mapping with functional annotation. Hum Mol Genet. 2021;31(1):32–40.
Zhang W, Li SY, Liu T, Li Y. Partitioning gene-based variance of complex traits by gene score regression. PLoS ONE. 2020;15(8): e0237657.
Zhu X, Stephens M. BAYESIAN LARGE-SCALE MULTIPLE REGRESSION WITH SUMMARY STATISTICS FROM GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat. 2017;11(3):1561–92.
Deng Y, Pan W. Significance Testing for Allelic Heterogeneity. Genetics. 2018;210(1):25–32.
Taylor KE, Ansel KM, Marson A, Criswell LA, Farh KK. PICS2: next-generation fine mapping via probabilistic identification of causal SNPs. Bioinformatics. 2021;37(18):3004–7.
Schilder BM, Humphrey J, Raj T. echolocatoR: an automated end-to-end statistical and functional genomic fine-mapping pipeline. Bioinformatics. 2022;38(2):536–9.
Tyler AL, Crawford DC, Pendergrass SA. The detection and characterization of pleiotropy: discovery, progress, and promise. Brief Bioinform. 2016;17(1):13–22.
Wu P, Wang B, Lubitz SA, Benjamin EJ, Meigs JB, Dupuis J. Approximate conditional phenotype analysis based on genome wide association summary statistics. Sci Rep. 2021;11(1):2518.
Conneely KN, Boehnke M. So many correlated tests, so little time! Rapid adjustment of P values for multiple correlated tests. Am J Hum Genet. 2007;81(6):1158–68.
Taraszka K, Zaitlen N, Eskin E. Leveraging pleiotropy for joint analysis of genome-wide association studies with per trait interpretations. PLoS Genet. 2022;18(11): e1010447.
Deng Y, Pan W. Testing Genetic Pleiotropy with GWAS Summary Statistics for Marginal and Conditional Analyses. Genetics. 2017;207(4):1285–99.
Ray D, Pankow JS, Basu S. USAT: A Unified Score-Based Association Test for Multiple Phenotype-Genotype Analysis. Genet Epidemiol. 2016;40(1):20–34.
Sitlani CM, Baldassari AR, Highland HM, Hodonsky CJ, McKnight B, Avery CL. Comparison of adaptive multiple phenotype association tests using summary statistics in genome-wide association studies. Hum Mol Genet. 2021;30(15):1371–83.
Guo B, Wu B. Integrate multiple traits to detect novel trait-gene association using GWAS summary data with an adaptive test approach. Bioinformatics. 2019;35(13):2251–7.
Turchin MC, Stephens M. Bayesian multivariate reanalysis of large genetic studies identifies many new associations. PLoS Genet. 2019;15(10): e1008431.
Bu D, Wang X, Li Q. Summary statistics-based association test for identifying the pleiotropic effects with set of genetic variants. Bioinformatics. 2023;39(4).
Deng Q, Song C, Lin S. An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics. Eur J Hum Genet. 2023.
Liu W, Xu Y, Wang A, Huang T, Liu Z. The eigen higher criticism and eigen Berk-Jones tests for multiple trait association studies based on GWAS summary statistics. Genet Epidemiol. 2022;46(2):89–104.
Svishcheva GR, Tiys ES, Elgaeva EE, Feoktistova SG, Timmers P, Sharapov SZ, et al. A Novel Framework for Analysis of the Shared Genetic Background of Correlated Traits. Genes (Basel). 2022;13(10).
Qi G, Chatterjee N. Heritability informed power optimization (HIPO) leads to enhanced detection of genetic associations across multiple traits. PLoS Genet. 2018;14(10): e1007549.
Jordan DM, Verbanck M, Do R. HOPS: a quantitative score reveals pervasive horizontal pleiotropy in human genetic variation is driven by extreme polygenicity of human traits and diseases. Genome Biol. 2019;20(1):222.
Ballard JL, O’Connor LJ. Shared components of heritability across genetically correlated traits. Am J Hum Genet. 2022;109(6):989–1006.
Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, Fontana MA, et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat Genet. 2018;50(2):229–37.
Lee CH, Shi H, Pasaniuc B, Eskin E, Han B. PLEIO: a method to map and interpret pleiotropic loci with GWAS summary statistics. Am J Hum Genet. 2021;108(1):36–48.
Guo B, Wu B. Powerful and efficient SNP-set association tests across multiple phenotypes using GWAS summary data. Bioinformatics. 2019;35(8):1366–72.