Multivariate methods and software for association mapping in dose‐response genome‐wide association studies
 Chad C Brown^{1, 3},
 Tammy M Havener^{2},
 Marisa Wong Medina^{4},
 Ronald M Krauss^{4},
 Howard L McLeod^{2} and
 Alison A Motsinger‐Reif^{1, 3}Email author
DOI: 10.1186/17560381521
© Brown et al.; licensee BioMed Central Ltd. 2012
Received: 3 April 2012
Accepted: 4 December 2012
Published: 12 December 2012
Abstract
Background
The large sample sizes, freedom of ethical restrictions and ease of repeated measurements make cytotoxicity assays of immortalized lymphoblastoid cell lines a powerful new in vitro method in pharmacogenomics research. However, previous studies may have over‐simplified the complex differences in dose‐response profiles between genotypes, resulting in a loss of power.
Methods
The current study investigates four previously studied methods, plus one new method based on a multivariate analysis of variance (MANOVA) design. A simulation study was performed using differences in cancer drug response between genotypes for biologically meaningful loci. These loci also showed significance in separate genome‐wide association studies. This manuscript builds upon a previous study, where differences in dose‐response curves between genotypes were constructed using the hill slope equation.
Conclusion
Overall, MANOVA was found to be the most powerful method for detecting real signals, and was also the most robust method for detection using alternatives generated with the previous simulation study. This method is also attractive because test statistics follow their expected distributions under the null hypothesis for both simulated and real data. The success of this method inspired the creation of the software program MAGWAS. MAGWAS is a computationally efficient, user‐friendly, open source software tool that works on most platforms and performs GWASs for individuals having multivariate responses using standard file formats.
Keywords
Pharmacogenomics Lymphoblastoid cell lines Chemosensitivity Chemotherapy Temozolomide Idarubicin MANOVA GWAS Simulation studyBackground
While genotyping technology can provide valuable individual genetic data, pharmacogenomics research has met little success in using this data to improve patient outcomes. This result is less surprising, considering that pharmacogenomics studies with human populations often map a limited a priori set of targeted genes for therapeutic agents whose heritability is unknown using clinical trial data with limited sample sizes and many potential confounders[1, 2]. Developed in response to these limitations, a new in vitro assay system utilizing dose‐response (DR) profiles of immortalized lymphoblastoid cell lines (LCLs) allows for rapid experimental testing of multiple agents using virtually unlimited sample sizes, for either linkage or association at the genome‐wide level[3]‐[13]. A more thorough discussion of the practical benefits for using LCLs in pharmacogenomics research can be found in recent review articles[1, 2].
However, obstacles for using LCLs to identify clinically relevant loci include elucidation of the translational relevance between donor and LCL, and the development of methods that are powerful for using DR data in gene mapping. For this second item, gene mapping studies (such as in genome‐wide association studies, or GWASs) attempt to find loci with genotypes having significantly different phenotypes. For cell models and other DR data, phenotypes are complex, and “different” is not well defined. Previous simulation studies show that the most powerful method depends on what these differences are[14]. For example, if differences in the DR curves between genotypes are due entirely to half inhibitory concentration (IC50), then summarizing each curve with it’s estimated IC50 and treating that as the response is the best course of action. However, if the true differences between genotypes is not IC50, then this method may not perform optimally. We would like a method that is robust to differences in the dose‐response profiles between genotypes. We would also like a method that is very powerful at detecting those kinds of differences that are likely to arise from biology. For instance, if a certain polymorphism affects the expression of an enzyme involved with inactivating a drug target, what kinds of differences in the DR profiles for LCL cytotoxicity will this polymorphism produce? Many previous studies have fit a non‐linear equation (often without testing the appropriateness of this model) to each DR curve, choosing one parameter from this fit, and perform association using this parameter estimate[3]‐[13]. As mentioned above, the “true” differences in the DR profiles between phenotypes may not be captured by this parameter.
A previous study investigated using many different univariate summaries (such as parameter estimates from a hill slope fit) as the response in simulated association testing[14]. This study investigated the power each summary measure had in association testing, using many different simulated differences in the DR profiles between genotypes. Every summary measure performed poorly under at least one alternative. However, methods that modeled all of the responses jointly performed at least adequately (and sometimes very well) under every alternative. At question, however, is which alternative(s) is(are) most representative of the differences between in the DR profiles between genotypes for meaningful loci that are likely to be encountered in practice? To this end, the current study uses actual differences in curves between genotypes from real data for effects that have strong support for representing true biology. These differences will be referred to as signals throughout. Each of these signals was created from markers that were found to be significant in genetic association studies and also had been validated from other data or from previous results. In this way, using real data to guide a simulation study, where these differences can be modeled after genuine biological signals, may give a more accurate assessment of which method performs best in practice.
Methods
Biological motivation for simulation study
The signals used for study were all from a set of GWASs of 515 LCLs exposed to either the cancer drug temozolomide or 5‐fluorouracil. A strong association (p= 3.3∗10^{−16}) was found between LCL cytotoxicity to temozolomide and locus rs531572, located within the gene coding for O^{6}‐methylguanine‐DNA methyltransferase [MGMT: ENSG00000170430]. In addition, a strong association (p= 6.8∗10^{−26}) was found between the same loci and expression levels for MGMT transcripts[15]. MGMT is known to repair DNA damaged by temozolomide, and genetic variants affecting MGMT are known to be associated with temozolomide clinical efficacy[16]. Similarly, suggestive associations (p= 5.9∗10^{−7} and) were found between LCL cytotoxicity to 5‐fluorouracil and locus rs2270311, located within the gene coding for chimerin 2 [CHN2: ENSG00000106069]. Significant differences in the expression of CHN2 have been found between between colon cancer cells having different levels of 5‐fluorouracil resistance[17].
Simulation and power comparisons of cell line methods
where Y_{ i j k } is the vector of viabilities for six concentrations for the k^{ t h } replication of the j^{ t h } individual having genotype i (the number of minor alleles). Here,$\hat{\Sigma}$ is the sample covariance of errors,${\hat{\mu}}_{i}$ is the vector of mean viabilities for genotype i, and ES is the effect size. Effect sizes ranged from 0 (corresponding to the null) to 1.0 (the observed differences between genotypes). The sample size was set to 500 and minor allele frequency (MAF) was set to 0.5.
In addition, each of these same methods were compared using a previous simulation described in[14], where differences in the DR curves between genotypes are due to differences in the distribution of hill slope parameters. Figure6 gives power curves for a representative sample of these simulations, where data were simulated under an additive genetic model, with equally spaced drug dosages and a MAF of 0.5. Panels C ‐ E represent power curves for each method where differences in curves between genotypes are due to the “Min”, “IC50” and “Slope” parameter distributions, respectively. Using the Friedman test, significant differences (p < 10^{−15} for all) in p‐values were found between methods for every positive (i.e. non‐null) effect size for both sets of simulations.
Results and discussion
The simulated data (for the current study) were generated using a multivariate normal model. In addition, MANOVA has an assumption of multivariate normality. This appears as an intentional way to provide MANOVA with an advantage, since correct modeling assumptions are guaranteed. However, our goal was actually to identify a statistical model with sufficient complexity to accurately capture most aspects of the real data. In addition, the univariate summary methods rely on ANOVA, a statistical method that is robust to the assumption of normality, reducing the inherent advantage MANOVA has over the other methods. Indeed, even though data generation was under multivariate normality, MANOVA is not assured to be the most powerful for every situation. First, if the number of responses is large, power attenuates, since the number of covariance parameters needing to be estimated grows with the square of the number of responses. Second the power of MANOVA compared to other methods seems to be reduced when the off‐diagonal covariance parameters decreases. This is shown in Figure7, where the mean vectors for each genotype are modeled after the temozolomide and MGMT signal, but residual covariances are modeled using an exponential decay or equal correlation. Here, the dominating power of MANOVA over the other methods is reduced as the correlation between residuals decreases. A final way in which MANOVA may not perform optimally, compared with other methods, is when the differences in the DR profiles between genotypes are due entirely to a simple summary measure. These ideas are illustrated in Figures8 and9, where data are generated for large (12) vector responses, correlations are low (ρ=0.25) and differences in the DR curves between genotypes can be summarized well using simple summary measures. In this case, the MANOVA method does not perform optimally, even though data are generated using a multivariate normal model.
Software implementing MANOVA in GWASs: Introducing MAGWAS
Because of the power and robustness of the MANOVA design in detecting true biological signals in LCL response, and more generally the need for analysis tools for GWASs having multivariate responses, the program MAGWAS (multivariate ANOVA genome‐wide association software) was developed[21]. MAGWAS calculates the significance of each single nucleotide polymorphisms (SNP), allows for inclusion of confounding variables and uses PLINK file formats[22, 23]. The program is free, fast to download (less than 1 MB), requires no installation, is platform independent and runs using a single line on a system’s command prompt. In addition, MAGWAS is efficient with system memory and GWASs are generally completed in much less than an hour on typical desktop systems. Indeed, analysis of the two data sets for the present study where each data set had about 500 individuals, and each individual having six responses and 33 covariates, repeated across about 2 million SNPs took less than 17 minutes apiece on a modest computing cluster (two Intel(R) Xeon(R) CPU E5450 processors). MAGWAS is written completely in Java and is registered as open source software under a GNU public license. Matrix manipulations are accomplished using the Java package JAMA[24] and distributional calculations are made using the Java package DistLib[25].
Conclusions
In conclusion, no method is uniformly superior for modeling the genetic effects for DR cell line data. However, unlike other univariate summary methods (such as estimating the IC50 for each DR curve and treating this estimate as the response), MANOVA makes no assumption about the nature of the DR curve or about what the differences in the DR curves between genotypes are expected to be. Unsurprisingly, using simulated data from a previous study, MANOVA was the only method that had good power under all alternatives considered. In addition, MANOVA was most powerful in detecting differences between genotypes for simulations based on signals obtained from real data. A final advantage of using MANOVA is that test statistics asymptotically follow a known distribution under the null, allowing for fast and accurate analysis of each loci in a GWAS, so long as sample sizes are not small. Quantile‐quantile (QQ) plots of the null distributions of test statistics from this study, as well as the null for the previous simulation, demonstrate that the type I error rate is under control in all conditions, except when data are generated using a gamma distribution for the error structure. Figure10 shows four representative examples of these QQ plots. Under the assumption that most loci have no true association in genome‐wide data, a similar result was found for QQ plots from a separate set of GWASs (results not shown) involving LCL response to 29 different anticancer drugs.
On the other hand, MANOVA is not guaranteed to always be optimal, especially if the number of responses for each individual is large, the correlations between responses are small, and/or the true differences in DR curves between genotypes are known to be due to entirely to single univariate summary measure. Also, the interpretation for MANOVA is somewhat more complicated, as traditional measures of effect size, such as model r‐squared (R2) and parameter estimates are not immediately clear. For example, R2 may vary substantially between responses at each concentration. Similarly, parameter estimates for the effect of SNP are unlikely to be the same, and may even have opposite directions, at different concentrations. Overall, we feel the benefits outweigh the handicaps and that MANOVA is a great method for gene mapping, or other association tests utilizing DR data.
The current study explored two basic classes of linear models as methods for testing differences between genotypes for DR data. These include using ANOVA to model the differences in DR summary measures, and using MANOVA to model the differences between response vectors jointly. However, several other methods could potentially be applied for modeling DR data. One of these includes adapting methods from longitudinal and repeated measures data analysis[26]. If the basic form of the covariance structure between residuals were known, these methods could potentially reduce the number of parameters being estimated, and may be ideal, especially in situations where the number of responses (concentrations) for each individual is rather large. Another set of methods employ semi and non‐parametric designs for mapping quantitative trait loci[27]‐[29]. Semi (non)‐parametric designs are attractive because they use minimal (no) modeling assumptions, therefore guarding against incorrect model specification. Future studies would benefit from comparing these designs to MANOVA in DR data from LCL cytotoxicities.
Abbreviations
 DR:

dose response
 LCL:

lymphoblastoid cell line
 GWAS:

genome‐wide association study
 QQ:

quantile‐quantile
 ANOVA:

analysis of variance
 MANOVA:

multivariate ANOVA
Declarations
Acknowledgements
This work was supported by NIH NCI R01CA161608 and T32GM081057 from the National Institute of General Medical Sciences and the National Institute of Health. Kevin Long, Director, PharmacoInformatics, UNC IPIT, helped with testing MAGWAS. Some parts of this manuscript were taken from a recent technical report[21].
Authors’ Affiliations
References
 Wheeler H, Dolan M: Lymphoblastoid cell lines in pharmacogenomic discovery and clinical translation. Pharmacogenomics. 2012, 13: 5570. 10.2217/pgs.11.121.PubMed CentralView ArticlePubMed
 Welsh M, Mangravite L, Medina M, Tantisira K, Zhang W, Huang R, McLeod H, Dolan M: Pharmacogenomic discovery using cell‐based models. Pharmacological rev. 2009, 61 (4): 413429. 10.1124/pr.109.001461.View Article
 Watters JW, Kraja A, Meucci MA, Province MA, McLeod HL: Genome‐wide discovery of loci influencing chemotherapy cytotoxicity. Proc Natl Acad Sci USA. 2004, 101: 1180911814. 10.1073/pnas.0404580101.PubMed CentralView ArticlePubMed
 Huang R, Duan S, Bleibel W, Kistner E, Zhang W, Clark T, Chen T, Schweitzer A, Blume J, Cox N: A genome‐wide approach to identify genetic variants that contribute to etoposide‐induced cytotoxicity. Proc National Acad Sci. 2007, 104 (23): 975810.1073/pnas.0703736104.View Article
 Dolan M, Newbold K, Nagasubramanian R, Wu X, Ratain M, Cook E, Badner J: Heritability and linkage analysis of sensitivity to cisplatin‐induced cytotoxicity. Cancer res. 2004, 64 (12): 435310.1158/00085472.CAN040340.View ArticlePubMed
 Bleibel W, Duan S, Huang R, Kistner E, Shukla S, Wu X, Badner J, Dolan M: Identification of genomic regions contributing to etoposide‐induced cytotoxicity. Human genet. 2009, 125 (2): 173180. 10.1007/s0043900806074.View Article
 Watson VG, Motsinger‐Reif A, Hardison NE, Peters EJ, Havener TM, Everitt L, Auman JT, Comins DL, McLeod HL: Identification and Replication of Loci Involved in Camptothecin‐Induced Cytotoxicity Using CEPH Pedigrees. PLoS ONE. 2011, 6: e1756110.1371/journal.pone.0017561.PubMed CentralView ArticlePubMed
 Peters EJ, Kraja AT, Lin SJ, Yen‐Revollo JL, Marsh S, Province MA, McLeod HL: Association of thymidylate synthase variants with 5‐fluorouracil cytotoxicity. Pharmacogenet Genomics. 2009, 19: 399401. 10.1097/FPC.0b013e328329fdec.View ArticlePubMed
 Peters E, Motsinger‐Reif A, Havener T, Everitt L, Hardison N, Watson V, Wagner M, Richards K, Province M, McLeod H: Pharmacogenomic characterization of US FDA‐approved cytotoxic drugs. Pharmacogenomics. 2011, 12 (10): 14071415. 10.2217/pgs.11.92.PubMed CentralView ArticlePubMed
 Watson V, Hardison N, Harris T, Motsinger‐Reif A, McLeod H: Genomic Profiling in CEPH cell lines distinguishes between the Camptothecins and Indenoisoquinolines. Mol Cancer Ther. 2011, 10 (10): 18391845. 10.1158/15357163.MCT100872.PubMed CentralView ArticlePubMed
 Duan S, Huang R, Zhang W, Mi S, Bleibel W, Kistner E, Cox N, Dolan M: Expression and alternative splicing of folate pathway genes in HapMap lymphoblastoid cell lines. Pharmacogenomics. 2009, 10 (4): 549563. 10.2217/pgs.09.8.PubMed CentralView ArticlePubMed
 Gamazon E, Duan S, Zhang W, Huang R, Kistner E, Dolan M, Cox N: PACdb: a database for cell‐based pharmacogenomics. Pharmacogenetics and genomics. 2010, 20 (4): 269PubMed CentralPubMed
 Stark AL, Zhang W, Mi S, Duan S, O’Donnell PH, Huang RS, Dolan ME: Heritable and non‐genetic factors as variables of pharmacologic phenotypes in lymphoblastoid cell lines. Pharmacogenomics J. 2010, 10: 505512. 10.1038/tpj.2010.3.PubMed CentralView ArticlePubMed
 Brown C, Havener T, Everitt L, McLeod H, Motsinger‐Reif A: A comparison of association methods for cytotoxicity mapping in pharmacogenomics. Frontiers in Stat Genet And Methodology. 2012, 2: 114.
 Brown CC, Havener TM, Medina MW, Mangravite L, Kraus RM, McLeod H, Motsinger‐Reif A: A genome‐wide association analysis of temozolomide response using lymphoblastoid cell lines reveals a clinically relevant association with MGMT. Pharmacogenetics and Genomics. 2012, 22 (11): 796802. 10.1097/FPC.0b013e3283589c50.PubMed CentralView ArticlePubMed
 Hegi M, Liu L, Herman J, Stupp R, Wick W, Weller M, Mehta M, Gilbert M: Correlation of O6‐methylguanine methyltransferase (MGMT) promoter methylation with clinical outcomes in glioblastoma and clinical strategies to modulate MGMT activity. J Clinical Oncol. 2008, 26 (25): 41894199. 10.1200/JCO.2007.11.5964.View Article
 Schmidt W, Kalipciyan M, Dornstauder E, Rizovski B, Steger G, Sedivy R, Mueller M, Mader R: Dissecting progressive stages of 5‐fluorouracil resistance in vitro using RNA expression profiling. Int j cancer. 2004, 112 (2): 200212. 10.1002/ijc.20401.View ArticlePubMed
 Jarek S: mvnormtest: Normality test for multivariate variables. 2009, [R package version 0.1‐7]
 Doornik J, Hansen H: An Omnibus Test for Univariate and Multivariate Normality*. Oxford Bulletin of Economics and Stat. 2008, 70: 927939.View Article
 Timm N: Applied multivariate analysis. 2002, 175 Fifth Avenue, New York, NY 10010, USA: Springer Verlag New York, Inc.
 Brown C, Motsinger‐Reif A: Software for genome‐wide association studies having multivariate responses: Introducing MAGWAS. NC State Department Stat Tech R. 2012, 19.
 Purcell S, Neale B, Todd‐Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, De Bakker P, Daly M: PLINK: a tool set for whole‐genome association and population‐based linkage analyses. The Am J Human Genet. 2007, 81 (3): 559575. 10.1086/519795.View Article
 Purcell S: PLINK v1.07. 2009, [http://pngu.mgh.harvard.edu/purcell/plink/]
 Hicklin J, Moler C, Webb P, Boisvert RF, Miller B, Pozo R, Remington K: JAMA v1.0.2. 2011, [http://math.nist.gov/javanumerics/jama/]
 Steinmetz PN, Warnes G, Warnes J: DistLib v0.9.1. 2007, [http://statdistlib.sourceforge.net/]
 Molenberghs G, Verbeke G: Models for discrete longitudinal data. 2006, 233 Spring Street, New York, NY 10013, USA: Springer Science+Business Media, Inc.
 Wu S, Fu G, Chen Y, Wang Z, Wu R: Genetic mapping of complex traits by minimizing integrated square errors. BMC genet. 2012, 13: 20PubMed CentralView ArticlePubMed
 Wu S, Yang J, Wu R: Semiparametric functional mapping of quantitative trait loci governing long‐term HIV dynamics. Bioinformatics. 2007, 23 (13): i569—i576View ArticlePubMed
 Yang J, Wu R, Casella G: Nonparametric functional mapping of quantitative trait loci. Biometrics. 2008, 65: 3039.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.