Software article | Open | Open Peer Review | Published:
Synthesis-View: visualization and interpretation of SNP association results for multi-cohort, multi-phenotype data and meta-analysis
BioData Miningvolume 3, Article number: 10 (2010)
Initial genome-wide association study (GWAS) discoveries are being further explored through the use of large cohorts across multiple and diverse populations involving meta-analyses within large consortia and networks. Many of the additional studies characterize less than 100 single nucleotide polymorphisms (SNPs), often include multiple and correlated phenotypic measurements, and can include data from multiple-sites, multiple-studies, as well as multiple race/ethnicities. New approaches for visualizing resultant data are necessary in order to fully interpret results and obtain a broad view of the trends between DNA variation and phenotypes, as well as provide information on specific SNP and phenotype relationships.
The Synthesis-View software tool was designed to visually synthesize the results of the aforementioned types of studies. Presented herein are multiple examples of the ways Synthesis-View can be used to report results from association studies of DNA variation and phenotypes, including the visual integration of p-values or other metrics of significance, allele frequencies, sample sizes, effect size, and direction of effect.
To truly allow a user to visually integrate multiple pieces of information typical of a genetic association study, innovative views are needed to integrate multiple pieces of information. As a result, we have created "Synthesis-View" software for the visualization of genotype-phenotype association data in multiple cohorts. Synthesis-View is freely available for non-commercial research institutions, for full details see https://chgr.mc.vanderbilt.edu/synthesisview.
Significant GWAS findings are being further investigated for replication and characterization, both in the populations in which the initial GWAS findings were discovered (such as European-Americans) as well as in new cohorts and populations. To increase power, meta-analysis is often used to combine results from multiple research sites. Multiple independent and correlated phenotypic measurements may be included in these analyses, such as measurements of cardiovascular disease and related biomarkers (lipids, inflammation, etc). Many of these studies characterize less than 100 SNPs. Visualization of these data an integral part of interpreting as well as sharing the complex and multi-layered results of these follow-up studies. The software "Synthesis-View" has been developed to visually synthesize multiple pieces of information of interest from these studies with the flexibility to perform multiple types of data comparisons.
Synthesis-View was extended from the previous software "LD-Plus" which also uses a flexible data display format of multiple data "tracks" that can be viewed . Within Synthesis-View, through the use of stacked data-tracks, information on SNP genomic locations, presence of the SNP in a specific study or analysis, as well as related information such as genetic effect size and summary phenotype information, is plotted according to user preference. Through these data visualizations, rapid comparisons of multiple forms of information are possible, not easily achievable through reviewing results in tabular form alone.
Synthesis-View was developed in Ruby and utilizes the RMagick graphics library, and the software tool can be used within a web-interface or at the command line. Both the web interface (Figure 1) and command line version offer multiple options for output plots (Table 1). Manuals, example files for producing test plots in Synthesis-View, and source code, are all available at the Synthesis-View website https://chgr.mc.vanderbilt.edu/synthesisview/create/setparams.
One file is necessary to produce a standard Synthesis-View plot. This file contains a column for SNP identification (such as rs number), a column with the corresponding chromosome for each SNP, and a column for SNP genomic location. The rest of the optional information will result in data tracks plotted if data are present, and can include p-values, odds ratios, allele frequencies, and sample size. If additional files are supplied, additional tracks are created including phenotypic summary information for continuous phenotypes, gene summary information, and linkage disequilibrium (LD) data plotted in Haploview style  format as D' or r2.
Availability and requirements
Project name: Synthesis-View
Project home page: http://chgr.mc.vanderbilt.edu/ritchielab/synthesisview
Operating systems(s): Linux, Mac OS X, Windows
Programming language: Ruby
Other requirements: RMagick
License: GNU General Public License
Any restrictions to use by non-academics:
The use of Synthesis-View is restricted to academic and non-profit users
Results and Discussion
Example Standard Output Plot
Figure 2 shows an example of Synthesis-View visualization for analysis across populations. For Figure 2 two input files were used (containing simulated data for example purposes): 1) containing the SNP association data and 2) containing phenotypic summary information. The various "tracks" of data are described, starting from the top to bottom:
Physical genome track. Synthesis-View provides information on the relative location of the SNP on a given chromosome and how that position relates to other SNPs in the same study. Lines lead from the chromosome locations to the IDs of each SNP. If the "Additional SNP locations" option is selected, the location of the SNPs within the chromosomal region are indicated. If SNPs are close together, the location of the first SNP in that group is indicated in the plot (to prevent text overlap). When the plot is first generated, an image of the plot is shown within the web browser that includes embedded links. If the results for a SNP are selected within this image, the NCBI SNP database page for that specific SNP is opened in the default web browser of the user.
(2) SNP presence/absence track. Not all SNPs may be available for all associations across study groups or populations. Thus, this track provides information on whether a SNP was used in the test of associations through the presence/absence of a colored box corresponding to the group, study, or phenotype.
(3) Significance track. The resultant p-values are plotted as the negative log10 of the p-value. Grey vertical lines behind the points create an "abacus-view" that allows for the eye to follow from the SNP presence/absence track down through the lower data tracks. An optional red horizontal line can be applied at a significance level of choice (in Figure 2 the p-value line is set at 0.01). Figure 2 shows how the p-value for each SNP association can be compared across populations (in this case, across race/ethnicities), and triangles indicate direction of the genetic effect (bottom of the triangle is the location of the p-value). For example, if the genetic effect is measured as beta, the triangles point up if beta is positive and point down if beta is negative. The direction enables the investigator(s) to quickly determine if direction of effect is consistent across groups, studies, or phenotypes.
Effect size track. The resultant effect size values (beta values here) are plotted. This track allows you to view the similar effect sizes across race/ethnicity in Figure 2. To omit the effect size plot, omit effect size information from the input file.
Coded allele frequency track. The coded allele frequency (CAF), the allele chosen by the user to compare across groups or studies, is optionally plotted so trends and differences in the data can be observed.
Sample size track. Optional plotting of the sample size for each genotype-phenotype association for each group/study/phenotype is available so the relationships between sample size and other results of the study can be explored. To plot without sample size, omit sample size information columns from the input file. If sample size is provided only as entire group summary information, rather than for individual SNP/phenotype regressions, a separate box will appear at the bottom of the plot with this summary information graphically represented.
Phenotype summary plot. Summary information for a single phenotype across several groups is plotted if a separate file of phenotype summary information is included. This is currently a feature for quantitative traits/continuous data. Future versions will incorporate methods to characterize categorical/case-control phenotype summary information.
Other Options for Standard Output Plot
When all SNPs are available for all tests of association, a legend appears indicating which colors correspond to which group, study, or phenotype. Figure 3 shows an example of this feature, as well as showing an example of using Synthesis-View to plot the results for multiple phenotypes (from Jeff et al. 2010, in preparation). Alternate colors for the data points can be specified by adding an additional column with the header of 'Color', along with the color selected for each group, in the phenotype summary file.
Comparisons can also be made across study stages, such as Figure 4 where the results of Table three, from Willer et al. 2008  are plotted, allowing for the visual comparison between individual stages of a study and the combined meta-analysis results. Of particular interest in Figure 4 are SNPs rs9989419, rs3764261, rs1864163, all of which have strongest association with high-density lipoprotein (HDL) cholesterol among the SNPs tested. Plotting the data in this format shows the similar location of these SNPs as well as their strong association with HDL levels. Investigators examining these data could postulate that the close proximity of these SNPs and the similar genetic effect sizes made evident by Synthesis-View suggest that these SNPs are in LD with one another, prompting further investigation of the data. Also, plotting the samples size is useful in (indirectly) visualizing the power of each test of association, which is necessary to interpret non-significant findings.
In Figure 5 the same Table three data from Willer et al.  is plotted again, however, different options are used. When there is a wide range of p-values, there can be compression of the less significant p-values when plotted, visible in Figure 4 for the results for SNPs other than rs9989419, rs3764261, rs1864163. Synthesis-View allows for the choice of significance level, such that any points more significant than the cutoff value are plotted at that value, in a larger size (in this case set at 1E-33). This feature allows for closer inspection of the less significant p-values. Also used in Figure 5 is the option for "jitter", where overlapping points are plotted with horizontal distance between them.
In some settings, studies have multiple phenotypes, as well as multiple groups (such as multiple race-ethnicities). In this case, Synthesis-View will plot the results for all the phenotypes for each group on separate tracks. Figure 6 shows an example of this feature, where associations were calculated for six phenotypes and a series of SNPs (from Jeff et al. 2010 in preparation), and investigated for three race/ethnicities: Mexican Americans (MA); European Americans (EA); and African Americans (AF). Thus, in this plot, the results across multiple phenotypes are plotted with a track for each race/ethnicity allowing for multi-layered results to be viewed together.
Another feature available in Synthesis-View is the plotting of a D' or r2 plot in Haploview style format  as the bottom-most track, shown in Figure 7 (From Jeff et al. 2010 in preparation). This plot will appear when D' or r2 data are provided in a separate file.
Forest Plot/Odds Ratio options in Synthesis-View
Synthesis-View also has an option for plotting odds ratio (OR) results in forest plot format. Figure 8 shows an example from the International Multiple Sclerosis Genetics Consortium (IMSGC) , where original IMSGC Multiple Sclerosis GWAS results were investigated for replication in an independent dataset. A Stage 1 analysis was performed to examine the replication of a series of SNP/phenotype associations. In Stage 2 a smaller subset of 19 SNPs were tested for further replication. The results from Stage 1, Stage 2, and the final combined analysis for 19 SNPs were presented in Table two of the manuscript , and the results are presented here in forest plot format using Synthesis-View. The tracks are as follows:
The first track, like with the standard Synthesis-View plot, is a physical genome track, displaying the chromosome and relative location of each SNP used in the association tests.
The next track is an optional significance track, displaying the p-values. A single color represents each group. In this case, a red line has been placed at a p-value of 0.05.
The next three tracks are odds ratio/forest plot tracks. Squares represent the OR point estimate, with lines representing the upper and lower 95% confidence intervals. Here the similarity of the results between Stage 1 and Stage 2 are visible. An additional option, not shown, is available. If a result is significant (the upper or lower boundary of the confidence intervals does not cross 1.0), the square can be plotted in larger size, allowing for quick visual identification of significant results in forest plots with a large number of results.
The second to last track is the CAF track. Colors match those of the groups of the previous tracks, allowing the user to identify trends in allele frequencies between groups which can aid in interpreting replication of results. The option of horizontal separation of overlapping points was also used here as the CAF measurements were very similar between the analyses.
The last track is the sample size track. Case/control sample size can either be plotted in separate tracks, or, as shown here with closed circles indicating cases, and open circles indicating controls in the same track. The colors match those of the groups of the previous tracks. This option is also available when the CAF for cases vs. controls are provided.
The upper panel of Figure 9 shows the plot that appears within the web browser, after "generate image" is selected. This plot contains embedded links for each SNP to the NCBI SNP database. The lower panel shows the NCBI SNP page that appeared when the SNP 17419032 was selected with the mouse.
An alternative way to view OR results is in stacked tracks where the eye moves from top to bottom, in more of the Synthesis-View standard format. If the forest plot option is not chosen in Synthesis-View, the default data plot is in this format. Unlike the forest plots of Figure 8 ORs are plotted as closed circles. When OR results are significant, the OR closed circle is plotted in a larger size, rendering it easy to discriminate significant results visually.
To date, most replication, meta-analysis, and even top GWAS results, are presented in tabular form. There are additional ways to display these results. We developed and describe here a visualization tool for studies that have data for less than 100 SNPs, which is typical for a targeted genotyping study using thousands to tens of thousands of samples. We emphasize that Synthesis-View is especially effective when data are being investigated across phenotypes, studies, and multiple race/ethnicities. Through visually incorporating results, details of individual SNP-phenotype relationships as well as larger trends in the interplay between information such as SNP location, sample size, data stratification, and allele frequencies can be viewed.
Bush WS, Dudek SM, Ritchie MD: Visualizing SNP statistics in the context of linkage disequilibrium using LD-Plus. Bioinformatics. 2010, 26: 578-579. 10.1093/bioinformatics/btp678.
Barrett JC, Fry B, Maller J, Daly MJ: Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005, 21: 263-265. 10.1093/bioinformatics/bth457.
Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM: Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet. 2008, 40: 161-169. 10.1038/ng.76.
Comprehensive follow-up of the first genome-wide association study of multiple sclerosis identifies KIF21B and TMEM39A as susceptibility loci. Hum Mol Genet. 2010, 19: 953-962. 10.1093/hmg/ddp542.
We would like to acknowledge the following individuals for their suggestions and ideas in designing Synthesis-View: Matthew Thomas Oetjens, Janina Jeff, Logan Dumitrescu, Fredrick Schumacher, Chris Haiman. This work was funded in part by the following grants: LM010040 and HG004798.
The authors declare that they have no competing interests.
SAP carried out design of Synthesis-View, SMD programmed, DCC and MR participated in design and coordination and drafting of the manuscript. All authors read and approved the final manuscript.