Improving the interpretability of random forest models of genetic association in the 1 presence of non-additive interactions

6 Background: Non-additive interactions among genes are frequently associated with a 7 number of phenotypes, including known complex diseases such as Alzheimer‘s, diabetes, and 8 cardiovascular disease. Detecting interactions requires careful selection of analytical 9 methods, and some machine learning algorithms are unable or underpowered to detect or 10 model feature interactions that exhibit non-additivity. The Random Forest method is often 11 employed in these efforts due to its ability to detect and model non-additive interactions. In 12 addition, Random Forest has the built-in ability to estimate feature importance scores, a 13 characteristic that allows the model to be interpreted with the order and effect size of the 14 feature association with the outcome. This characteristic is very important for 15 epidemiological and clinical studies where results of predictive modeling could be used to 16 define the future direction of the research efforts. An alternative way to interpret the model is 17 with a permutation feature importance metric which employs a permutation approach to 18 calculate a feature contribution coefficient in units of the decrease in the model‘s 19 performance. Currently, it is unclear which Random Forest feature importance metric 20 provides a superior estimation of the true informative contribution of features in genetic 21 association analysis. 22

methods, and some machine learning algorithms are unable or underpowered to detect or 10 model feature interactions that exhibit non-additivity. The Random Forest method is often 11 employed in these efforts due to its ability to detect and model non-additive interactions. In 12 addition, Random Forest has the built-in ability to estimate feature importance scores, a 13 characteristic that allows the model to be interpreted with the order and effect size of the 14 feature association with the outcome. This characteristic is very important for 15 epidemiological and clinical studies where results of predictive modeling could be used to 16 define the future direction of the research efforts. An alternative way to interpret the model is 17 with a permutation feature importance metric which employs a permutation approach to 18 calculate a feature contribution coefficient in units of the decrease in the model's 19 performance. Currently, it is unclear which Random Forest feature importance metric 20 provides a superior estimation of the true informative contribution of features in genetic 21 association analysis. 22 7 phenotype, while non-interacting features contribution effect is orders of magnitude smaller and 1 insufficient to be detected accurately by both importance metrics (Fig.4A, B). 2

Evaluation of the feature importances in real-world datasets with epistatic interactions. 3
We used HIBACHI simulation framework to obtain datasets with strong epistatic interaction 4 between the features -genetic variants and the phenotype. We were able to demonstrate that 5 regardless of the dataset's parameters, the best method to determine features' informative 6 contribution to the phenotype is PFI. To validate our findings for real-world data we analyzed 7 two datasets with complex disease phenotype which previously have been identified to have 8 epistatic interactions among SNPs. The subset of seven SNPs was selected from the Alzheimer's 9 disease data and the subset of six SNPs was selected from the Glaucoma dataset as described 10 above (see Methods 2.4). A ViSEN analysis and plot for the Alzheimer's dataset (Fig. 5A) 11 revealed one SNP with large independent effect affiliated with ApoE gene (rs429358, MI 0.08) -12 a known risk factor for Alzheimer disease, and three strong two-way gene-gene interactions 13 (rs4955208 -rs7782571, IG 0.05; rs12785149 -rs12209418, IG 0.05; rs2414325 -rs1931073, IG 14 0.05). ViSEN analysis of the Glaucoma dataset revealed two strong gene-gene interactions 15 (rs7738052 -rs1489169, IG 0.015, and rs10915315 -or rs1266924, IG 0.015). We built a 16 predictive model for each dataset with the RF classifier: the model was tuned with grid search 17 and had classification balanced accuracy 62.6% for the Alzheimer dataset, and 60 % for the 18 Glaucoma dataset. We verified the significance of the cross-validated balanced accuracy scores 19 of the optimized classifiers by doing a permutation test [33] and confirmed that both models 20 have reliable classification performance ( Figure 6). We further calculated PFI and BIC estimates 21 for the optimized RF models. For Alzheimer dataset (Fig. 7A) the most important feature, 22 according to both PFI and BIC metrics, is the SNP with the largest independent effect -23 8 rs429358. This SNP has been discovered by ViSEN analysis and expected to have strong effect 1 on the phenotype. However, the consecutive feature importance order diverged between two 2 metrics. According to PFI rank, SNPs located at second and third position belong to the same 3 interaction and valued with the highest IG out of all calculated pairwise interactions for this 4 dataset (Fig 5A). At the same time, SNPs located at the second and third position by BIC ranking 5 didn't create a strong pairwise interactions with each other. We observed a similar discrepancy 6 between the metric rank evaluations for the Glaucoma datasets ( Fig 7B): while SNP rs2157719 7 has the largest main effect among all SNPs and was assigned to be the top feature by both 8 metrics, the following order of SNPs was outlined conversely by different metrics. SNP 9 rs10915315 has the largest number of detected pairwise interaction among all SNPs and is 10 ranked second by PFI, while only fourth by BIC; SNP rs1489169 has three pairwise interactions 11 detected and is ranked third by PFI and second by BIC. Interestingly, both metrics identified the 12 less important and non-important SNPs in the same order, suggesting that the issue with the 13 discrepancy between the metrics evaluations should be address for the top features only. 14

Discussion 15
Multiple research studies suggested that epistatic interactions are widespread by nature and 16 that many genes work in an interactive manner [5]. Epistatic interactions are expected to be 17 found in a variety of pathophysiological processes with one of them most likely to be 18 Alzheimer's disease [8]. The most powerful predictor of Alzheimer's disease at this time is 19 ApoE E4 gene variation: one or two copies of ApoE is associated with an increased risk of 20 disease onset [34]. However, some carriers of ApoE E4 variation haven't developed an 21 Alzheimer's disease so it is very likely that other genetic factors are involved in disease's 22 pathophysiology. ViSEN entropy-based analysis revealed several strong pairwise genetic 23 9 interactions, along with the known largest independent signal from the ApoE variant (rs429358) 1 (Fig. 5A). Furthermore, ViSEN method allocated epistatic interactions within the Glaucoma 2 disease dataset: several strong pairwise interactions in addition to the independent main effect 3 contribution from the SNP affiliated with retinal ganglion cells pathology (rs2157719) have been 4 confirmed (Fig. 5B). Evaluation of an informative contribution of a single genetic factor 5 involved in the two-or three-way interactions towards the phenotypic outcome is a challenging 6 analytical task and it requires a non-linear solution which RF classifier is notorious for. 7 Therefore, we aimed to identify whether RF classifier is capable to detect genetic signatures that 8 were previously confirmed by ViSEN analysis and whether different RF's feature importance 9 metrics will be able to agree on the rank. 10 Two feature importance metrics were considered, PFI and BIC, and each was compared after 11 RF analysis of data derived from genome-wide association studies of Glaucoma and 12 Alzheimer's. The resulting feature ranking confirms the lack of consensus between the studied 13 metrics (Fig 6). Indeed, while BIC and PFI identified the SNPs with the largest independent 14 main effects (rs429358 and rs2157719) as a top feature in both real-world datasets, genes 15 involved in the epistatic interactions have been assigned different importance ranks by different 16 metrics. More specifically, in the Alzheimer data, the second and the third most important 17 feature predicted by PFI belong to the same interacting pair (rs4955208 and rs7782571), while 18 the same rank positions predicted by BIC were occupied by SNPs that do not belong to the same 19 genetic interaction (Fig 5A, Fig 7A). A similar discrepancy has been observed in the Glaucoma 20 dataset: PFI and BIC indicated different interacting pairs of SNPs as the second and third most 21 important feature (Fig. 7A), however SNP predicted to be the second most important by PFI is a 22 10 part of four additional SNP-SNP interactions, while SNP predicted second by BIC is a part of 1 only three interacting SNP pairs (Fig. 5B). 2 Such uncertainty has been associated with RF predictions in the past and we attempted to 3 reveal the true interpretation with the computational experiments driven by HIBACHI 4 simulations. The HIBACHI framework has the ability to consider any desirable biological 5 concept in the form of mathematical expressions that define the genotype-phenotype relationship 6 and evolve models that can be used to simulate data consistent with that relationship. We set up a 7 simulation goal to maximize two-or three-way interactions among features and compared RF's 8 feature importance metrics with the sensitivity analysis results of the simulated data that 9 provided us with the ground truth information about the feature ranks ( Figure 2). In all 10 HIBACHI experimental setups, which included such factors as the proportion of cases and 11 controls, sample size and interaction complexity, PFI metrics produced the most precise feature 12 ranking (Table 1, Fig. 3). Although BIC metric misplaced feature ranks for the majority of the 13 replicates, it correctly identified features that belong to the interactive pair or trio by putting them 14 as a top two or three features correspondingly. Therefore, we can suggest that BIC metric can 15 still be used with some caution, however, when there is a need for an absolute precision, PFI 16 estimation method should be used. 17 PFI has some limitations we didn't cover in our experimentsit is particularly sensitive to 18 highly correlated features. In case such features are present in the dataset, PFI requires a special 19 preprocessing directed onto the removal of such features. An example of this could be a 20 hierarchical clustering based on the Spearman's rank correlation coefficients with following 21 cluster-based filtering. In future studies, a more advanced permutation feature importance 22 techniques may be considered which can include pairwise permutation importance, joint 23 importance by maximal subtrees, and joint variable importance as well as corrected Gini 1 importance score [35]. In addition, a more complex epistatic schemes needs to be explored (e.g. 2 different combinations of marginal and interaction effects as in [36], multiple strong interactions 3 and/or combination of two-and three-way interactions). Within our simulation framework we 4 mostly observed one or two strong interactions and this might be an overly simplistic given the 5 complexity of common diseases such as Alzheimer's and glaucoma. 6

Conclusion 7
In this study, we performed a comparative analysis of feature importance metrics with the aim 8 to improve Random Forest's interpretability in datasets with complex interactions. By analyzing 9 both real and simulated data, we established that the permutation feature importance metric 10 provides more precise feature importance rank estimation in the presence of non-additive 11 interactions. 12

Methods 13
Random forest and its properties 14 RF is a popular ML algorithm because it often demonstrates good performance while 15 remaining relatively easy to optimize and interpret. RF algorithms belong to a Bagging 16 (Bootstrap Aggregation) type of ensemble ML methods where a group of weak learners in a 17 form of decision trees (DT) classify the outcome using majority vote. Decision trees are sensitive 18 to the data they are trained and often suffer from high variance problem especially when the 19 depth of the tree is high. To address that RF algorithm trains each tree on a subset of samples 20 drawn from the complete dataset with the bootstrap procedure. An additional source of 21 randomness within the RF algorithm is introduced during the construction of a tree when 22 selecting a split node from the random sample of features (in place of the greedy search through 23 12 all feature set like in DT). These two sources of randomness aim to decorrelate weak learners 1 and correspondingly decrease the variance of an estimator by combining diverse trees prediction 2 via majority vote. 3 During the construction of a DT, the decrease in the error function can be calculated for a 4 feature at each split node. In a classification task, this is often done via estimating the Gini score 5 or the entropy score. The function decreases can be averaged across all trees and returned as 6 feature importances score (the greater the decrease the higher feature importance). Feature 7 importance scores are often conveniently implemented as a RF function which makes this 8 method more interpretable. 9 Permutation feature importances 10 Permutation feature importance metrics were first introduced by Breiman in his Random 11

Forest manuscript [37] and further extended by Altmann [38] to correct for the bias of the RF's 12
Gini importance and entropy criterion for feature selection. We utilize a custom implementation 13 of PFI which could be applied to any machine learning classification and regression algorithms 14 ( Figure 1). Here, PFI metric is calculated with following steps: 1) the dataset is shuffled and split 15 into the training and testing datasets 2) the model is fitted on the training dataset and the 16 balanced accuracy is estimated on the testing dataset, 3) feature 1 out of N is permuted for the 17 testing dataset 4) the balanced accuracy is estimated on the permuted testing dataset 5) the 18 relative decrease of the permuted and non-permuted balanced accuracies is calculated and stored 19 as relative decrease in accuracy, 4) step 2 and 3 are repeated for the remaining N-1 features, 5) 20 steps 1 through 4 are replicated for N-1 times with the new seeds for shuffle and split procedure, 21 6) mean of relative decrease in accuracy per feature is calculated across the splits and is used as 22 features' PFI value. Retrieved PFI values were normalized to sum to 1. At the beginning of the GP optimization process, a population of individuals with random 22 mathematical expression trees is initialized and is further subjected to mutational and 23 14 recombinational processes. This process serves as a source of variation for the expression trees 1 and have a pre-defined rate. After that, a user-defined fitness function is calculated and the best-2 fitted individuals are selected for the next round. At the last optimization round, an overall best-3 fitted individual is used as an output dataset. 4 For the aims established within this study, we wanted to generate datasets with non-additive 5 epistatic interactions that would involve two and three genetic variants and, therefore, we have 6 defined a fitness function that maximizes two-way and three-way information gain (IG) term. 7 Entropy-based IG approach to detect epistatic interactions has been introduced by Moore et al. 8 [41] and extended by Fan et al. [42]. Two-way IG defined as: 9 Where I is mutual information that describes the dependency between variables X, Y and Z. 11 It measures the reduction of uncertainty of one variable (Z) given the knowledge of others (X 12 and Y) It is expressed with entropy terms: 13 Where entropy H defined with the probability mass function: 15 Definitions of three-way IG can be found in Hu et al., 2013 [43]. We used the following version 16 of this term: 17 to encourage multiple combinations of genetic interactions. In addition to the variability in the 2 interaction complexity, the following factors have been considered in the HIBACHI 3 experimental schemes: percent of cases (25% and 50%) to address an imbalanced dataset 4 structure, and sample size (1000 and 10000). Each experimental setup was reproduced 100 times 5 using random seed generator and the whole population of replicates was considered in the 6 consecutive analysis. All simulated data used here is available upon request. 7

Sensitivity analysis 8
We implemented a HIBACHI-based sensitivity analysis to determine the true feature 9 importances ranks and the effect sizes. For this the permutation-based framework was 10 implemented with the following steps ( Figure 2): 1) split HIBACHI-generated dataset into the 11 outcome vector and the feature set 2) permute feature vector 1 out of N and re-evaluate the 12 outcome by applying the HIBACHI-generated expression tree 3) calculate the dissimilarity of the 13 outcome as of mismatch between the perturbed and unperturbed feature set 4) repeat the estimate 14 2) for the remaining feature and normalize the counts by the total sum; 5) replicate steps 2-4 100 15 times and calculate the average of the replicates per feature as a final true feature importance 16 score. Sensitivity scores were further normalized to sum to 1. 17 Real world data analysis. 18 To examine the convergence of the RF's feature importance metrics we used two real-world 19 datasets with evidence for non-additive interactions. The first includes preselected SNPs from a 20 genome-wide association study of Alzheimer's Disease while the second includes preselected 21 SNPs from a genome-wide association study of Primary Open Angle Glaucoma (POAG). The 22 Alzheimer's dataset came from the Alzheimer's Disease Neuroimaging Initiative during which 23 16 functional MRI was taken every six to twelve months for patients with three health conditions 1 (neuro-typical (identified here as controls), and mild cognitive impairment and Alzheimer's 2 disease (identified here as cases)). A computational evolution system [44] identified a model of 3 seven SNPs with evidence of non-additive interactions and a classification accuracy of 0.738. 4 Among these SNPs are the SNP with the large main effect that is located in the APOE gene 5 (rs429358)a known risk factor for Alzheimer disease, four SNPs located within genes with 6 known functionality/disease state (rs1931073an intergenic region near the PPAP2B gene that 7 is participating in cell-cell interactions, rs7782571near the ISPD gene that is associated with 8 the Walker-Walburg syndrome, rs4955208in the OSBPL10 genes which are expressed into 9 intracellular lipid receptor, rs12209418in the PKIB gene that codes a protein kinase inhibitor) 10 and two remaining SNPs (rs2414325, rs12785149) are located within genes with unknown 11 functionality. 12 The glaucoma dataset came from the Glaucoma Gene Environment Initiative study and 13 contained with POAG individuals identified as cases and healthy individuals as controls. This 14 dataset has been previously analyzed by Moore at al. [45] with the EMERGENT algorithm that 15 resulted in the identification of a model of six SNP's with evidence of non-additive interactions 16 and a classification accuracy of 0.615. Two of these SNPs (rs2157719, and rs1266924) are 17 located within the genes that were previously associated with glaucoma disease, two SNPs 18 (rs10915315, and rs1489169) are located within the genes that are associated with glaucoma-19 non-related diseases and relevant pathology, and two more SNPs (rs936498, and rs7738052) are 20 located within the genes that were not previously associated with any disease, but have a known 21 functionality that is relevant to visual cortex and retina development. 22 We used the visualization of the statistical interaction network (ViSEN) method [46] to 1 analyze and visualize SNP main effects, and two-way and three-way gene-gene interactions 2 among SNPs for real-world datasets. The ViSEN method calculates the mutual information (MI) 3 between individual SNP (genotype) and the phenotype, the pairwise interaction between every 4 pair of SNPs and the phenotype and the three-way interaction between every combination of 5 three SNPs and the phenotype via the IG term. A positive IG indicates synergistic (i.e. non-6 additive) effects of SNPs on the phenotype. The IG metrics used in the ViSEN method were 7 designed to detect pure epistatic interactions and excluded all lower-order effects (by subtracting 8 all main effects and pairwise synergies in cases of the three-way term). 9

Consent for publication 19
Not applicable 20 Availability of data materials 21