This article has Open Peer Review reports available.
PGxClean: a quality control GUI for the Affymetrix DMET chip and other candidate gene studies with non-biallelic alleles
© Rotroff et al.; licensee BioMed Central Ltd. 2014
Received: 8 June 2014
Accepted: 18 October 2014
Published: 6 November 2014
PGxClean is a new web application that performs quality control analyses for data produced by the Affymetrix DMET chip or other candidate gene technologies. Importantly, the software does not assume that variants are biallelic single-nucleotide polymorphisms, but can be used on the variety of variant characteristics included on the DMET chip. Once quality control analyses has been completed, the associated PGxClean-Viz web application performs principal component analyses and provides tools for characterizing and visualizing population structure.
The PGxClean web application accepts genotype data from the Affymetrix DMET chip or the PLINK PED format with genotypes annotated as (A,C,G,T or 1,2,3,4). Options for removing missing data and calculating genotype and allele frequencies are offered. Data can be subdivided by cohort characteristics, such as family ID, sex, phenotype, or case–control status. Once the data has been processed through the PGxClean web application, the output files can be entered into the PGxClean-Viz web application for performing principal component analysis to visualize population substructure.
The PGxClean software provides rapid quality-control processing, data analysis, and data visualization for the Affymetrix DMET chip or other candidate gene technologies while improving on common analysis platforms by not assuming that variants are biallelic. The web application is available at http://www.pgxclean.com.
KeywordsSNP Bioinformatics Data visualization Genomics
While current single nucleotide polymorphism (SNP) chip technologies produce generally very high quality data, it is still critical that quality control and data cleaning steps are components of any genetic study analysis plan. There are several quality control (QC) steps that have become “best practices” in data cleaning in SNP data . These steps have been integrated into commonly used software packages, such as PLINK . These software packages have generally been developed for genome-wide SNP chips, that are designed to genotype bi-allelic SNPs or copy number variants.
While current knowledge supports that biallelic SNPs are the most common in the genome, we know there are a number of genes with multi-allelic genotypes, complex haplotype/diplotype structures, etc. that are not readily genotyped on standard genome-wide chips . In the field of pharmacogenomics, this is of particular importance because many of the established associations are in genes that do not follow the typical biallelic assumptions and are not well covered on standard chips [4, 5]. In response to this, both Affymetrix (http://www.affymetrix.com) and Illumina (http://www.illumina.com) have developed specific genotyping arrays for pharmacogenomics genes.
As these chips are growing in popularity, it drives the need for quality control tools that properly “clean” and process data without requiring that the genotypes are biallelic. In the current study, we introduce PGxClean, a web-based software application with a graphical user interface (GUI) that was designed to perform basic QC and publication quality figures for typical quality control procedures. The software was designed for the output format of the Affymetrix DMET Plus chip, but can also be used with the commonly used PLINK “PED” format  to make it readily compatible with other data (both from the Illumina pharmacogenomics chip, and data collected on other platforms).
PGxClean is implemented in the freely available R-Shiny software [6, 7], and several available packages, as described in the documentation found at http://cran.us.r-project.org. The source code is available for download at http://www4.stat.ncsu.edu/~motsinger. The PGxClean website can be accessed at http://www.pgxclean.com. Example data files are available for download on the homepage (both DMET and PED formats) to help format your own files or for experimenting with the website functionality. If a PED file is uploaded, an accompanying MAP file must also be uploaded to provide the appropriate column headers. Additional details about the MAP files are available under the ‘Documentation’ tab in the navigation menu on the website. A searchable preview of uploaded data is available by selecting the ‘Raw Upload Data Table’ tab under the navigation menu. For DMET chip data (or other genetic data on a similar scale, with about 2000 variants), the whole QC process can be completed in only a few minutes. The software is designed for candidate gene data, and would not run efficiently for genome-wide scale data. In addition, the software does not test for cryptic relatedness, gender checks, or perform genotyping concordance with reference samples (e.g. HapMap). However, these tools could be implemented into future iterations of PGxClean.
Genotyping efficiency/Missing Data. The first QC step is screening both variants and individuals for high levels of missing data. By default, if variants have more than 5% data missing across all individuals, they are removed from the dataset. Then, individuals are checked for missing data, and if more than 5% of variants are missing, these individuals are removed. The percentage can be user-specified to be specific to the needs of a particular study.
Test for Deviation from Hardy-Weinberg Proportions. Testing markers for deviation from proportions expected under Hardy-Weinberg equilibrium  has become an important check for overall genotyping quality, and PGxClean will test for deviations using Fisher’s Exact test (so they are valid even in very small samples or for low allele frequencies) and will report results with or without correction for multiple comparisons. Typical implementations of tests for Hardy-Weinberg disequilibrium assume that variants are biallelic. For PGxClean, we used expanded versions of the Hardy-Weinberg equation for multiple alleles to calculate expected values. Additionally, this analysis can be performed on stratified portions of the dataset, specified in a “PED” format. For example, if a case–control study was performed, this filter should be performed on only the control samples, or you might want to perform this analysis separately for different ethnic groups in a heterogeneous sample. This can be accomplished by selecting the group you would like to stratify using the drop down menu that appears once the PED file is uploaded. Additionally, if a DMET formatted file is uploaded, the “case–control study” box can be selected (Figure 1). This will stratify the analysis based on IDs with “control” in the sample name. Additional details regarding this step are available under the “documentation” tab at http://www.pgxclean.com.
After processing data using PGxClean, the user can download a zipped file containing several outputs including, the newly ‘cleaned’ data file, allele frequencies, the results of the Hardy-Weinberg equilibrium test data, and a genotype file. The genotype output file can be used in PGxClean-Viz, an extension of PGxClean that provides tools for performing principal component analysis (PCA) and other visualizations.
Principal Component Analysis (PCA)
While this software was designed for the DMET Plus chip, it also allows for PED format files, so it should be readily useable for a wide range of genotype data and is unique in its ability to process non-biallelic variants.
We would like to thank Kevin Long and David Reif for testing the software.
- McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9: 356-369. 10.1038/nrg2344.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, De Bakker PI, Daly MJ: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.View ArticlePubMedPubMed CentralGoogle Scholar
- Hodgkinson A, Eyre-Walker A: Human triallelic sites: evidence for a new mutational mechanism?. Genetics. 2010, 184: 233-241. 10.1534/genetics.109.110510.View ArticlePubMedPubMed CentralGoogle Scholar
- Peters EJ, McLeod HL: Ability of whole-genome SNP arrays to capture ’must have’ pharmacogenomic variants. 2008, 9 (11): 1573-1577.Google Scholar
- Oetjens MT, Denny JC, Ritchie MD, Gillani NB, Richardson DM, Restrepo NA, Pulley JM, Dilks HH, Basford MA, Bowton E, Masys DR, Wilke RA, Roden DM, Crawford DC: Assessment of a pharmacogenomic marker panel in a polypharmacy population identified from electronic medical records. Pharmacogenomics. 2013, 14: 735-744. 10.2217/pgs.13.64.View ArticlePubMedPubMed CentralGoogle Scholar
- RStudio and Inc: shiny: Web Application Framework for R. 2013Google Scholar
- R Development Core Team: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013, ISBN 3-900051-07-0, URL http://www.R-project.org/ Google Scholar
- Hardy GH: Mendelian proportions in a mixed population. Science. 1908, 28: 49-50. 10.1126/science.28.706.49.View ArticlePubMedGoogle Scholar
- Pearson K: Principal components analysis. Lond Edinb Dublin Philos Mag J Sci. 1901, 6: 559-View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.