 Methodology
 Open Access
 Open Peer Review
Transitiontransversion encoding and genetic relationship metric in ReliefF feature selection improves pathway enrichment in GWAS
 M. Arabnejad^{1},
 B. A. Dawkins^{2},
 W. S. Bush^{3},
 B. C. White^{1},
 A. R. Harkness^{4} and
 B. A. McKinney^{1, 2}Email authorView ORCID ID profile
 Received: 8 July 2018
 Accepted: 22 October 2018
 Published: 3 November 2018
Abstract
Background
ReliefF is a nearestneighbor based feature selection algorithm that efficiently detects variants that are important due to statistical interactions or epistasis. For categorical predictors, like genotypes, the standard metric used in ReliefF has been a simple (binary) mismatch difference. In this study, we develop new metrics of varying complexity that incorporate allele sharing, adjustment for allele frequency heterogeneity via the genetic relationship matrix (GRM), and physicochemical differences of variants via a new transition/transversion encoding.
Methods
We introduce a new twodimensional transition/transversion genotype encoding for ReliefF, and we implement three ReliefF attribute metrics: 1.) genotype mismatch (GM), which is the ReliefF standard, 2.) allele mismatch (AM), which accounts for heterozygous differences and has not been used previously in ReliefF, and 3.) the new transition/transversion metric. We incorporate these attribute metrics into the ReliefF nearest neighbor calculation with a Manhattan metric, and we introduce GRM as a new ReliefF nearestneighbor metric to adjust for allele frequency heterogeneity.
Results
We apply ReliefF with each metric to a GWAS of major depressive disorder and compare the detection of genes in pathways implicated in depression, including Axon Guidance, Neuronal System, and G ProteinCoupled Receptor Signaling. We also compare with detection by Random Forest and Lasso as well as random/null selection to assess pathway size bias.
Conclusions
Our results suggest that using more genetically motivated encodings, such as transition/transversion, and metrics that adjust for allele frequency heterogeneity, such as GRM, lead to ReliefF attribute scores with improved pathway enrichment.
Keywords
 Machine learning
 Feature selection
 Genomewide association study (GWAS)
 Genetic relationship matrix (GRM)
 Transition and transversion
Background
ReliefF is a nearestneighbors feature selection algorithm that is known for its ability to identify statistical interactions in high dimensional data [1, 2]. Specifically, it has been shown to identify genegene interaction effects in simulated and real genomewide association studies (GWAS) [2]. ReliefF uses what is called a “diff” function to determine nearest neighbors in the space of single nucleotide polymorphisms (SNPs) and to compute the importance of a SNP based on its ability to separate cases and controls in the SNP space.
While ReliefF analysis of GWAS data depends critically on its ability to measure the degree of dissimilarity between genotype states, the diff function used up to this point has been extremely simple. For example, the standard ReliefF genotype diff between two subjects is binary: the diff is 0 when the genotypes of the two subjects at a SNP are identical and 1 if their genotypes are not identical. The distance between a pair of subjects is obtained by summing the diff values in a cityblock (Manhattan) metric across all SNPs. The binary nature of this standard diff is likely an oversimplification that misses information because there are degrees of difference between genotypes. In addition to a metric based on allelesharing differences, we develop a transition/transversion (Ti/Tv) metric that accounts for physicochemical differences of nucleotides and a Genetic Relationship Matrix (GRM) [3] metric that accounts for allele frequency heterogeneity.
The main goal of the current study is to develop and compare combinations of metrics between SNPs and subjects in ReliefF feature selection. We also compare with statistical learning feature selection methods Random Forest [4] and Lasso (least absolute shrinkage and selection operator) [5]. Lasso has been used in GWAS [6, 7] but is parametric and generally uses a strong independence assumption among features. Random Forest has also been applied to GWAS and has fewer constraints than regression, which is an advantage when a multivariate genetic architecture may be involved in disease susceptibility [8–10]. We previously used Random Forest and penalized logistic regression as methods for comparing epistasis detection in simulated data [11]. When genes interact and have no marginal effect, we found that Random Forest has limited power to detect genegene interactions in high dimensional data, which was confirmed in Ref. [12]. The diminished Random Forest importance scores for interacting variants is attributable to the independence assumption in the tree nodesplitting criterion.
In the current study, we use enrichment of pathways related to major depressive disorder (MDD) to compare feature selection methods. Early GWAS studies of MDD had limited success at finding significant variants due to the contribution of many loci with small effect sizes as well as the heterogeneous characteristics of MDD and the complex interaction between genetic variation and environmental factors [13]. In recent studies, many small main effect loci have been identified through the accumulation of extremely large samples [14, 15]. Identifying broad pools of regulating, modulating or interacting SNPs that confer risk for a target disorder is an important goal. For example, bipolar disorder (BD) may occur in a family in which there is a primary susceptibility gene, but the majority of BD may involve the interactions of multiple genes [16]. Detecting these interaction effects with ReliefF may be improved by tailoring metrics to GWAS data.
The current study is organized as follows. We describe the overall strategy, briefly review the relevant components of ReliefF, describe a new allele sharing ReliefF diff, develop a new 2d transition/transversion genotype encoding and accompanying new ReliefF diff. We also introduce the first use of GRM to compute nearest hits and misses in ReliefF. We apply ReliefF using combinations of diffs and metrics with other statistical learning methods to 463 cases of MDD and 459 controls, and we test the top selected SNPs and their corresponding genes for overlap with biological pathways related to mental disorders. Our results suggest that using more genetically motivated metrics (allele sharing and Ti/Tv) and metrics that adjust for allelic heterogeneity (standardization by allele frequency in GRM) lead to ReliefF scores that improve enrichment of biologically relevant pathways.
Methods
Relief feature selection algorithm
are the mean diffs with respect to SNP g_{ν} of all subjects R_{i} (i=1…m) from their knearestneighbor misses [M_{j}(R_{i}) in Eq. (2)] and hits [H_{j}(R_{i}) in Eq. (3)]. The k nearest misses for a subject R_{i}, are the k subjects nearest to R_{i} but in a different phenotype class than R_{i}. Similarly, the set of hits of R_{i} is the set of k subjects that are nearest to R_{i} while being in the same phenotype class as subject R_{i}. An importance weight of SNP g_{ν} (W_{R}[g_{ν}]) is higher if the average of the miss diffs for the instances is greater than the average hit diffs. Thus, a SNP with a greater positive value of W_{R}(i.e,. \( {\overline{M}}_{g_{\nu }}>{\overline{H}}_{g_{\nu }} \)) is a better predictor of the phenotype because the genotypes of the SNP tend to separate instances in different classes more than instances in the same class. The diff function computes the amount that two genotypes are different for SNP g_{ν} between two subjects R_{i} and R_{j}. In the next subsection, we discuss in detail the new and old diff functions that will be compared.
New ReliefF diffs and metrics
We introduce three diff functions for measuring the genetic dissimilarity between pairs of individuals at a single locus. The first diff is the standard used in ReliefF for categorical variables, which we refer to as genotype mismatch (GM). The second metric accounts for allele sharing, which we refer to as allele mismatch (AM). The third diff further accounts for mutation type through transition/transversion differences (Ti/Tv). These first three diffs will be used to compute attribute importance and to compute cityblock (Manhattan) distances between subjects. We will discuss these nearestneighbor metrics and the genetic relationship matrix (GRM) in a later subsection.
Genotype mismatch
where genotype(g_{ν}, R_{1}) is the genotype for individual R_{1} for SNP g_{ν}. In other words, two individuals have zero diff if they have identical genotypes and they have unit diff if they have different genotypes.
Allele mismatch
where g_{iν} is the number of copies of the reference allele for the ν^{th} SNP of the i^{th} individual. In other words, the value of g_{1ν} is the number of minor alleles in the genotype: 0, 1 or 2. Then the return value of diff_{AM} (g_{ν}, R_{1}, R_{2}) is 0, 0.5 or 1 when the two individual have 2, 1 or 0 alleles in common, respectively.
Transition/transversion 2d encoding and associated diff
For transition mutations, we want our encoding to contract the distance (relative to transversion encoding) between two homozygous states and between a homozygous state and the heterozygous state (blue arrows, Fig. 4). Again we let the heterozygous state be \( \mid Aa\Big\rangle {}_{\mathrm{Ti}}=\left(\frac{1}{\sqrt{2}},\frac{1}{\sqrt{2}}\right) \), centered on the diagonal (θ = π/4) like transversions, but instead of being orthogonal, the homozygous states are contracted toward the diagonal with an angle of separation θ = π/6.
Thus, the Ti/Tv diff for transversion SNPs is equivalent to AM because a heterozygous state is half the distance between either homozygous state.
By design, this encoding causes the Ti homozygous diffs (3/4 difference) to be smaller than diffs between Tv homozygous states (1 difference) because transition mutations stay in the same biochemical family (purine to purine or pyrimidine to pyrimidine). Similarly, the encoding causes diffs between heterozygous and homozygous Ti genotypes (1/4 difference) to be smaller than the corresponding Tv diffs (1/2 difference).
Nearestneighbor distances based on Manhattan metric and the genetic relationship matrix (GRM)
However, one may also replace diff=diff_{AM} (Eq. 5) or diff=diff_{TiTv} (Eq. 8) in Eq. (9). Regardless of the diff, when an attribute’s importance score is calculated, it uses the k nearest neighbors as determined in the space of all other attributes, which allows ReliefF to identify important attributes that may involve complex higherorder interaction architecture.
and g_{iν} is the number of copies of the reference allele for the ν^{th} SNP of the i^{th} individual and p_{ν} is the frequency of the reference allele [3]. Each genotype in the summand is standardized by p_{ν} to account for the differences in allele frequency between SNPs. In addition to comparing diff functions (Eqs. 4, 5, 8) in the attribute importance score, we also compare Manhattan (Eq. 9) and GRM (Eq. 10) methods for computing the pairwise distances to find nearest neighbors.
Non Reliefbased comparison methods: Random Forest, lasso, and random gene
Random Forests (RF) is a widely used machine learning classifier and feature selector that grows an ensemble of classification trees in bagged samples with random attribute selection [21]. To measure the importance of a feature after training, the values of that feature are permuted among the training data and the outofbag error is again computed on this perturbed data set. The importance score for the feature is computed by averaging the difference in outofbag error before and after the permutation over all trees and the score is normalized by the standard deviation of these differences [4, 22–24]. We used the “Ranger” implementation of Random Forest, which is included in our inbix software to compute classification and variable importance.
We used the PLINK software to perform Lasso (lasso). We adjusted for the first five principal components, and the PCs were not subject to model selection. The top Lasso variants were chosen by top regression coefficients. We used h^{2} = 0.5 as the estimate of the heritability to calibrate the regression and we used λ_{min} = 0.001 for the L1 penalty parameter. Finally, we generated lists of random genes as a null comparison list that shows the effect of pathway size on enrichment. A random set of 500 genes is randomly sampled 100 times and the average overlap of the 500 genes is computed for each pathway. Estimating the expected number of overlapping genes by chance helps to calibrate the overlaps of each pathway for the nonrandom feature selection methods.
ReliefF software implementation

./inbix bfile <bedfile> relieff pheno <phenofile> snpmetricnn <nnmetric> snpmetricdiff <diffmetric> out <resultsfile>
where <diffmetric> can take on values gm, am, or ti/tv and <nnmetric> can also take on values gm, am, or titv with manhattan or Euclidean for combining the individual SNP diffs. The user may alternatively select grm in <nnmetric>, which uses GRM as the nearestneighbor metric and does not use the diffs. Any combination of <diffmetric> or <nnmetric> is allowed. For the MDD GWAS with 922 individuals and 281,648 SNPs (after filtering), the GRM metric takes approximately 8 hours and the other metrics take approximately 12 hours of CPU time (additional details in the Additional file 1).
We use the constantk ReliefF algorithm in inbix with the diffs and metrics described above. With inbix it is possible to optimize the number of neighbors for each attribute [18]. However, for this study, we use the constant value, k = ⌊m/6⌋, for ReliefF nearest neighbors, where m is the number of samples. The value k = ⌊m/6⌋ is the inbix default and was chosen based on Ref. [28] where it was shown to approximate the adaptive radius Relief method, MultiSURF [29], which balances power to detect epistatic effects and main effects.
GWAS data, filtering and mapping
In this study we compare ReliefF metrics with each other and with other analysis methods based on enrichment of selected features in functional pathways for MDD. We used a recent GWAS of MDD [30] that includes 922 European individuals that were recruited through a survey of 1259 individuals who filled out forms and telephone interviewed for DSMIV covering depressive, bipolar, psychotic, alcohol, substance and anxiety disorders as well as family history of mood disorders. After exclusions, extracted DNA was genotyped with the Illumina Omni1Quad microarray for 463 cases of MDD and 459 controls. For all ReliefF analyses, we used constant k = ⌊m/6⌋ = 138 nearest neighbors.
Filtering and mapping of genes and pathways
Additional details of data processing and analysis, including command line scripts, is provided in the Additional file 1. In the initial steps of analysis, dimensionality reduction is performed on SNPs by linkage disequilibrium (LD) with threshold 0.5 and minor allele frequency threshold 0.01. The goal of this filtering is to remove highly redundant and very low signal data as well as obtain a manageable number of variants for machine learning analysis (we obtain 281,648 variants).
A list of ranked SNPs for each method is obtained from the algorithms to be compared (Fig. 1 method overview). We use Ensembl IDs to map the top SNPs to genes [31]. Given the rsnumber of a SNP, the algorithm finds the location of the variant relative to genes, and the SNP is mapped to the gene symbol of the closest gene [32]. In the supplement, we include a link to our webservice that accepts a list of SNP rsnumbers and returns mapped genes to a table, and we include the R code for the mapping. Despite LD pruning, many top SNPs will map to the same gene; thus, we begin the mapping with more than 500 top SNPs so that we end up with 500 top unique genes.
We then use Molecular Signatures Database (Msigdb) [33] to identify the number of genes in our top 500 gene selection lists that overlap with target pathways. The overlaps are based on HUGO gene symbols. Our goal is not to compute the statistical significance of overlap for discovery, but to compare the number of genes found in known pathways relative to other gene ranking methods. Thus, the size of the gene background is not a concern. We also use random gene selection, mentioned above, to show the expected amount of overlap with a pathway by chance.
Results
We compare combinations of ReliefF attribute diff functions for use in the average miss and hit calculations (Eqs. 2 and 3) and metrics for computing nearest neighbors in the full space of SNPs. The diff functions are GM, AM, and Ti/Tv (Eqs. 4, 5, and 8) and nearest neighbor metrics are the Manhattan metric (Eq. 9) with the GM, AM and Ti/Tv diffs and the GRM metric (Eq. 10). We compare the three combinations of <diffmetric> and two possible <nnmetric> for a total of 3x2=6 combinations (Figs. 5, 6 and 7). We combine the three diffs with the same diff in a Manhattan metric to form the first three “<nndiff><attributediff>” combinations: GMGM, AMAM, TiTvTiTv. We also combine the three diffs with GRM nearestneighbor metric to form the other three “<GRMnn><attributediff>” combinations: GRMGM, GRMAM, GRMTiTv.
We first note that for random gene selection (“Random Genes” on rightmost side of Figs. 5, 6 and 7), the pathway overlap is correlated with the size of each pathway. As expected, choosing random genes will result in a certain amount of overlap with a pathway by chance in proportion to the size of the pathway. All methods perform better than chance (“Random Genes”) for detecting all pathways. For the Axon Guidance pathway, TiTvTiTv (Eq. 8 for the Manhattan metric and attribute diff) detected the most genes (light blue, Fig. 7). For Signaling by GPCR pathway, GRMTiTv (GRM nearest neighbor metric and TiTv attribute diff) detected the most genes (green, Fig. 7). For Neuronal System, GRMGM performed best (dark blue, Fig. 5).
Increasing in complexity of the diff (GM, AM, TiTv in Figs. 5, 6 and 7, respectively), shows pathway enrichment increasing when the attribute diff is used in the Manhattan metric (GMGM < AMAM < TiTvTiTv). This suggests a benefit to including transition/transversion information in the attribute diff calculation for attribute importance. When the diffs are combined with GRM, the GPCR pathway enrichment increases significantly. The GRM metric adjusts for heterogeneity of allele frequencies, and detecting genes that contain SNPs with such heterogeneity likely benefits from GRM.
Discussion and conclusions
Machine learning feature selection is a powerful tool for discovery in large data like GWAS with complex population and interaction structure. ReliefF is particularly efficient and powerful at selecting genes that are enriched for genegene interactions. However, in past studies, the diff function used to compute nearest neighbors and for updating attribute importance has treated information about SNPs simplistically. Thus, we introduced more complex and genetically relevant mathematical functions for computing diffs, including the development of a new 2d transition/transversion genotype encoding and associated diff. To find nearest neighbors, ReliefF implementations typically use a Manhattan metric of the attribute diffs. Euclidean can also be used to combine diffs, but past results have indicated little difference with Manhattan. A Mahalanobis distance function has also been developed, which allows for nonspherical neighborhoods [35]. In addition to Manhattan with the new diff functions, we used the genetic relationship matrix (GRM) to compute nearestneighbor distances, which has not been used previously in ReliefF.
When testing for pairwise epistatic effects in a linear model, one may decompose the epistatic effects into additive and dominant encodings [36]. ReliefF has less flexibility to mix encodings than a pairwiseSNP linear model; however, ReliefF ranks SNPs within the context of all other SNPs in the dataset, which may include pairwise and higherorder interactions. In our approach, we are able to mix encodings in a given ReliefF analysis by using different diff functions for attribute importance (Eqs. 1–3) and for finding nearest neighbors (Eqs. 9 or 10). The GM diff is based on a dominant singlelocus encoding and the AM diff is based on an additive encoding. The Ti/Tv diff is based on a new 2d Ti/Tv encoding where a genotype is mapped onto a unit sphere and contracts transition genotypes closer together than corresponding transversion genotypes (Fig. 4).
For each method, the top 500 SNPs were mapped to genes and overlap with the relevant biological pathways for major depressive disorder (MDD) was calculated. Our results provide evidence that using either AM (Eq. 5) or Ti/Tv (Eq. 8) diffs in the attribute importance score calculation (Eqs. 1–3) has an advantage over the simple GM diff (Eq. 4). The detection of genes in certain pathways also can be improved by combining the attribute diffs with the GRM metric (Eq. 10) for computing nearest neighbors.
The GRM method for finding nearest neighbors has the useful property of adjusting for alleleic heterogeneity. Using GRM to compute nearest neighbors results in the best enrichment for two of the three pathways: GPCR Signaling with GRMTiTv (green, Fig. 7, GRM nearest neighbor metric and TiTv attribute diff) and Neuronal System with GRMGM (dark blue, Fig. 5). The 2d TiTv encoding is the same as the AM diff for transversion SNPs but results in genotype differences that are contracted closer together for transition SNPs. Using TiTv results in the best enrichment for the third pathway: Axon Guidance with TiTvTiTv (light blue, Fig. 7, TiTv used for the nearest neighbor metric diff and attribute diff).
We focused on the analysis of real data as a proof of principle. Additional insight may be obtained by using a simulation strategy that incorporates transition and transversion differences that affect the phenotype. The challenge is to make such a simulation biologically realistic and not artificially biased toward one method. We used ReliefF’s simple nearest neighborfinding method with k = ⌊m/6⌋ because it balances the ability to find main effect and interaction effects [28]. However, there are other Reliefbased methods that may be used to optimize genetic findings. For example, one may use an adaptive attributespecific number of neighbors to improve power to detect main effects and interaction effects [18]. One may also increase power through adaptive radius versions of Relief, like SURF [37], MultiSURF [29], and STIR [28], or through backwards elimination versions of Relief, like iterative ReliefF.
The ReliefF methods generally perform better than the nonReliefF methods (Random Forest and Lasso). However, we note that no single analysis method can extract all information from a wholegenome association study (i.e., no free lunch) and each method finds unique gene signatures that can contribute to the overall picture of the pathway or phenotype. Thus, combining metrics and feature selection methods may be a good strategy for maximizing the detection of relevant genes. There are opportunities to improve these methods by incorporating additional information about the biological properties of the data, similar to phylogenetic substitution matrices or using coding/noncoding information. One may also assess the false discovery rate of different metrics using the STIR approach for determining the statistical significance of Relief scores [28].
Declarations
Acknowledgements
Not applicable.
Funding
This work has been supported in part by The William K. Warren Foundation and the National Institute of General Medical Sciences Center Grants P20GM121312 and P20GM103456. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and materials
ReliefF and other methods used in this study are implemented in our command line bioinformatics tool called inbix (http://insilico.utulsa.edu/index.php/inbix/), which is written in C++ and publicly available from GitHub (https://github.com/insilico/inbix).
Author’s contributions
MA, BW, WB, AH, BD, and BM contributed to the design of the study. MA and BM wrote the manuscript. MA and BW performed analyses. All authors revised the manuscript critically for important intellectual content and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interest.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 RobnikŠikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn. 2003;53:23–69.View ArticleGoogle Scholar
 Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. ReliefBased Feature Selection: Introduction and Review. J Biomed Inform. 2018;85:189203. https://doi.org/10.1016/j.jbi.2018.07.014.View ArticleGoogle Scholar
 Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genomewide complex trait analysis. Am J Hum Genet. 2011;88:76–82.View ArticleGoogle Scholar
 Breiman L. Random forests. Mach Learn. 2001:5–32.Google Scholar
 Tibshirani R (2011) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Statistical Methodol 73:273–282.Google Scholar
 Jiang Y, He Y, Zhang H. Variable selection with prior information for generalized linear models via the prior LASSO method. J Am Stat Assoc. 2016;111:355–76.View ArticleGoogle Scholar
 Wang H, Aragam B, Xing EP (2017) Variable selection in heterogeneous datasets: a truncatedrank sparse linear mixed model with applications to genomewide association studies. In: 2017 IEEE Int. Conf Bioinforma Biomed. IEEE, pp 431–438.Google Scholar
 Brieuc MSO, Ono K, Drinan DP, Naish KA. Integration of Random Forest with populationbased outlier analyses provides insight on the genomic basis and evolution of run timing in Chinook salmon ( Oncorhynchus tshawytscha ). Mol Ecol. 2015;24:2729–46.View ArticleGoogle Scholar
 Stephan J, Stegle O, Beyer A. A random forest approach to capture genetic effects in the presence of population structure. Nat Commun. 2015;6:7432.View ArticleGoogle Scholar
 Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet. 2018. https://doi.org/10.3389/fgene.2018.00237.
 McKinney BA, Crowe JE, Guo J, Tian D. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS Genet. 2009;5:e1000432.View ArticleGoogle Scholar
 Winham SJ, Colby CL, Freimuth RR, Wang X, de Andrade M, Huebner M, Biernacka JM. SNP interaction detection with random forests in highdimensional genetic data. BMC Bioinformatics. 2012;13:164.View ArticleGoogle Scholar
 Levinson DF, Mostafavi S, Milaneschi Y, Rivera M, Ripke S, Wray NR, Sullivan PF. Genetic studies of major depressive disorder: why are there no genomewide association study findings and what can we do about it? Biol Psychiatry. 2014;76:510–2.View ArticleGoogle Scholar
 Hyde CL, Nagle MW, Tian C, Chen X, Paciga SA, Wendland JR, Tung JY, Hinds DA, Perlis RH, Winslow AR. Identification of 15 genetic loci associated with risk of major depression in individuals of European descent. Nat Genet. 2016;48:1031–6.View ArticleGoogle Scholar
 Wray NR, Ripke S, Mattheisen M, et al. Genomewide association analyses identify 44 risk variants and refine the genetic architecture of major depression. Nat Genet. 2018;50:668–81.View ArticleGoogle Scholar
 Craddock N, Jones I. Genetics of bipolar disorder genetics of. bipolar disorder. 1999:585–94.View ArticleGoogle Scholar
 Kira K, L a R. A practical approach to feature selection. Proc Ninth Int Work Mach Learn. 1992:249–56.Google Scholar
 McKinney BA, White BC, Grill DE, Li PW, Kennedy RB, Poland GA, Oberg AL. ReliefSeq: a genewise adaptiveK nearestneighbor feature selection tool for finding genegene interactions and Main effects in mRNASeq gene expression data. PLoS One. 2013;8:e81527.View ArticleGoogle Scholar
 Collins DW, Jukes TH. Rates of transition and Transversion in coding sequences since the humanrodent divergence. Genomics. 1994;20:386–96.View ArticleGoogle Scholar
 Vrieze SI, McGue M, Miller MB, Hicks BM, Iacono WG. Three mutually informative ways to understand the genetic relationships among behavioral disinhibition, alcohol use, drug use, nicotine use/dependence, and their cooccurrence: twin biometry, GCTA, and genomewide scoring. Behav Genet. 2013;43:97–107.View ArticleGoogle Scholar
 Chen CCM, Schwender H, Keith J, Nunkesser R, Mengersen K, Macrossan P. Methods for identifying SNP interactions: a review on variations of logic regression, random Forest and Bayesian logistic regression. IEEE/ACM Trans Comput Biol Bioinforma. 2011;8:1580–91.View ArticleGoogle Scholar
 Qi Y. Random Forest for bioinformatics. Ensemble Mach Learn. 2012:307–23.Google Scholar
 Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH (2006) Feature selection using a random forests classifier for the integrated analysis of multiple data types. In: 2006 IEEE Symp. Comput. Intell. Bioinforma. Comput. Biol. IEEE, pp 1–8.Google Scholar
 Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL. Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics. 2009;10(78).View ArticleGoogle Scholar
 Lareau CA, White BC, Oberg AL, Kennedy RB, Poland GA, McKinney BA. An interaction quantitative trait loci tool implicates epistatic functional variants in an apoptosis pathway in smallpox vaccine eQTL data. Genes Immun. 2016;17:244–50.View ArticleGoogle Scholar
 Davis NA, Lareau CA, White BC, Pandey A, Wiley G, Montgomery CG, Gaffney PM, McKinney BA. Encore: genetic association interaction network centrality pipeline and application to SLE exome data. Genet Epidemiol. 2013;37:614–21.View ArticleGoogle Scholar
 Purcell S, Neale B, ToddBrown K, et al. PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet. 2007;81:559–75.View ArticleGoogle Scholar
 Le TT, Urbanowicz RJ, Moore JH, McKinney BA. Statistical inference Relief (STIR) feature selection. Bioinformatics 2018 Sep. 2018:18. https://doi.org/10.1093/bioinformatics/bty788.
 GranizoMackenzie D, Moore JH. Multiple threshold spatially uniform reliefF for the genetic analysis of complex human diseases. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 7833 LNCS. 2013:1–10.Google Scholar
 Mostafavi S, Battle A, Zhu X, et al. Type I interferon signaling genes in recurrent major depression: increased expression detected by wholeblood RNA sequencing. Mol Psychiatry. 2014;19:1267–74.View ArticleGoogle Scholar
 Zerbino DR, Achuthan P, Akanni W, et al. Ensembl 2018. Nucleic Acids Res. 2018;46:D754–61.View ArticleGoogle Scholar
 McLaren W, Pritchard B, Rios D, Chen Y, Flicek P, Cunningham F. Deriving the consequences of genomic variants with the Ensembl API and SNP effect predictor. Bioinformatics. 2010;26:2069–70.View ArticleGoogle Scholar
 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL (2005) Gene set enrichment analysis : A knowledgebased approach for interpreting genomewide.Google Scholar
 Tomita H, Ziegler ME, Kim HB, et al. G proteinlinked signaling pathways in bipolar and major depressive disorders. Front Genet. 2013;4:1–12.View ArticleGoogle Scholar
 Chang CC. Generalized iterative RELIEF for supervised distance metric learning. Pattern Recogn. 2010;43(8):2971–81.View ArticleGoogle Scholar
 Fish AE, Capra JA, Bush WS. Are interactions between cisregulatory variants evidence for biological epistasis or statistical artifacts? Am J Hum Genet. 2016;99:817–30.View ArticleGoogle Scholar
 Greene CS, Penrod NM, Kiralis J, Moore JH. Spatially uniform ReliefF (SURF) for computationallyefficient filtering of genegene interactions. BioData Min. 2009;2:1–9.View ArticleGoogle Scholar