To show the utility of PhenoGram, and the ways that multiple options can be combined for different types of plots, we describe here several example uses of this software. For the first set of examples, we have used a subset of data from the NHGRI GWAS Catalog to demonstrate some features of PhenoGram, highlighting some of the similarities and differences in our plots compared to the NHGRI GWAS Catalog plots. We chose this data because allowed us to represent multiple phenotypes across the genome and highlight other relationships in the data such as pleiotropy or ancestry. In addition, the GWAS Catalog data could be prepared as input to PhenoGram with a single database query and minimal data.
Here, we chose a subset of NHGRI GWAS catalog results with a diverse range of eight selected phenotypes as an example: rheumatoid arthritis, Crohn’s disease, blood pressure, Alzheimer’s disease, breast cancer, pancreatic cancer, colorectal cancer, and prostate cancer. Figure 2 shows a basic PhenoGram plot summarizing the SNP locations for GWA-significant associations with these eight phenotypes. Like the NHGRI GWAS catalog plots, each line connects a chromosomal location to a colored circle depicting the associated phenotype. A key of phenotypes and corresponding circle colors are displayed across the bottom of the image. PhenoGram has multiple options for altering the graphical style of the colored circles. For Figure 2, the options to outline the circles (−O) and increase the phenotype font size (−F) were used.
Depending on the amount of data to be plotted, as well as the proximity of genomic regions, different spacing may need to be used to optimally plot multiple data points. For example, an input file with a great number of phenotypes may produce a plot with circles that are too closely juxtaposed. Thus, PhenoGram has several options for modifying the spatial presentation of the circles or other annotation on PhenoGram plots. Figure 3 shows the results of using different PhenoGram spacing algorithms that can mitigate the issue of overlapping plotted data. The first spacing method is standard spacing and is the default spacing method used by PhenoGram. The equal spacing method (−p equal) allows the researchers to space the circles at equal intervals along the chromosome. A third spacing method is proximity spacing (−p proximity) which minimizes circle overlap while still attempting to place circles or other annotation near respective chromosomal locations.
The colors of the plotted circles can be alternately generated based on five different algorithms, shown in Figure 4. For ten or fewer phenotypes, the color list method (−c list) restricts the possible colors to those that are easily differentiated. In plots with a greater number of phenotypes, the standard color generator (−c generator) creates colors with maximum separation between all possibilities. The web-safe color option (−c web) restricts all possibilities to 216 web-safe, randomly selected possibilities. The least restrictive method is the random generator (−c random) that assigns colors without regard for color proximity. Finally, it is possible to provide in the input file a column that designates a group identifier for a subset of phenotypes such that all those of a similar identifier are plotted in a gradient of one color. Figure 4 shows the grouping method (−c group) in a plot to differentiate NHGRI GWAS catalog cancer phenotypes from non-cancer phenotypes.
Similar to grouping data by phenotype, it is possible to overlay a second grouping by ancestry. Shown in Figure 5, the plot resulting from the incorporation of this data into the input file depicts each ancestry group as a unique shape while still differentiating phenotypes with a color generation method. Here, the phenotype shapes are displayed without a black outline. GWAS catalog data was also used in this plot in order to show the combination of the diverse phenotype colors and distinct shapes by ancestry across the genome. PhenoGram currently accepts up to three different ancestry groups, with each subsequent group beyond three appearing as a circle. Figure 5 displays how PhenoGram can help visualize the relationships between genome location, phenotypes, and ancestry.
PhenoGram can also create plots that contain, rather than colored shapes, only colored lines that transverse the chromosomes. In this way, the software is also useful for visualizing genome or single-chromosome SNP coverage from a genotyping array as well as to show locations of sequenced loci or other regions of interest. Figure 6 incorporates the line plotting option (−C) with base-pair position information to display the coverage of genotyping for the custom Immunochip genotyping array, an array focused on autoimmune and immune system related genetic variants [6]. Further, it is possible in the PhenoGram input file to highlight base-pair regions via the use of integer-coded color options and to annotate positions. In Figure 6, a dense region of genotyping of the array on chromosome six is annotated; this region is the major histocompatibility complex (MHC) region. In line plots, it may be useful to apply the transparent (−T) or thin (−n) line options to improve visualization in densely plotted genome regions.
Copy-number variants (CNVs) are a growing area of genetic variant exploration for neurodevelopmental disorders. Recently, a comparison was made of two microarray technologies used in the detection of CNVs. Figure 7 shows the CNV region overlap between results of an Illumina microarray and a custom microarray that was targeted for genomic hotspots of deletions and duplications [7, 8]. Another example, using this approach for single SNPs instead of CNVs (not shown here), would be to use PhenoGram with two different colors highlighting the density and location of a series of low frequency variants vs. the density and location of a series of more common variants.
Another option with PhenoGram is to show part of a region in more detail. Depending on the amount of data to be plotted and/or the region of interest, plotting only one chromosome can be useful, and this feature was used to plot individual chromosomes for Figures 3 and 4. Although our annotation spacing algorithms attempt to optimize the presentation of various shapes such as circles representing phenotypes, it can be necessary to visually expand densely annotated chromosomal regions. Figure 8 uses the NHGRI GWAS Catalog data from the eight aforementioned phenotypes to expand on a cluster of closely positioned phenotypes.
We have added an option in PhenoGram to show the location of cytogenetic bands across the ideogram, and we show an example in Figure 9. Genes are not uniformly distributed along the length of chromosomes. Cytogenetic bands identify biologically relevant chromosomal structure, highlighting regions that are more or less likely to be gene-rich and/or genotyped, and standard regions have been identified that can be visualized on an ideogram documented through the UCSC browser [9] that we downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/. For example, “G-bands” are less gene-rich than “R-bands” [10], and we identify G-bands in PhenoGram using variations of grey and represent R-bands in white on the ideogram. There are also regions of the genome containing highly condensed heterochromatin that are largely transcriptionally silent, we have identified those in dark blue colors. The biggest regions of heterochromatin are in the long arm of the Y-chromosomes and close to the centromeres of chromosomes 1, 9, and 16. Smaller heterochromatin regions are found at the centromere of each chromosome, and the p-arms of chromosomes 13, 14, 15, 21, and 22. We have also marked the “stalks” in light blue, these are five regions on the acrocentric chromosomes and contain genes that code for ribosomal RNA.