While our example application is quite small in comparison to genome-wide association studies, it exemplifies some important points regarding BN structural learning. First, while the claim that a BN can represent causal relations is tempting, strictly enforcing a causal interpretation by disallowing nonsensical causal directions can limit the identification of important associations. For example, while the influence of gender and smoking on bladder cancer was clearly identified by our causal model (Figure 2), the association of cancer with the XRCC3_241 SNP was missed. This may not be surprising, given that cancer is effectively the independent variable in the process of selecting subjects for a case–control study. In the non-causal model (Figure 3), the XRCC3_241 SNP is identified as a child of CANCER (as is SMOKER), while GENDER continues to be a parent of both CANCER and SMOKER. This is consistent with the role of GENDER as a consideration in the process of selecting controls.

Interestingly, in the non-causal model, the XRCC3_04 SNP is revealed as being associated with CANCER by virtue of its membership in the Markov blanket. The V-structure that is formed at the XRCC3_241 node is a distinctive feature of BN modeling; such a structure implies that the two parents are marginally (i.e., unconditionally) independent, but become dependent when conditioned on the value of the child. A chi-squared test applied to the data confirms the fact that XRCC3_04 and CANCER are marginally independent (χ^{2} = 0.85, p = 0.36, n = 1113), but become significantly positively associated (χ^{2} = 7.75, p = 0.005, n = 558) conditional on XRCC3_241 being wildtype. A one-at-a-time search strategy (e.g. Andrew 2009) will typically miss such an association. This is an example of Simpson’s paradox, and appropriately capturing such situations in a BN allows for the accurate representation of complex relations.

In most real-world datasets, much information is lost when only complete observations are considered in statistical analysis. The EM algorithm provides a practical means for estimating model parameters without disregarding observations with missing values. In our example, this greatly increased our sample size and allowed for the discovery of toenail arsenic levels as a significant predictor of bladder cancer. TOENAIL_AS was not determined to be a parent of CANCER, but rather was found to be a parent of XRCC3_241, together with CANCER and XRCC3_04. This indicates the presence of gene-environment interactions, as reflected in the pattern of odds ratios calculated from this structure.

By focusing our attention on the role of TOENAIL_AS and its potential membership in the Markov blanket of CANCER, we were able to perform a comprehensive comparison of candidate structures after implementing the EM algorithm on each. Normally, one would be interested in comparing too many different structures to apply EM and scoring algorithms to them all. In such cases, the SEM algorithm or one of its variations would be necessary. Structural learning in the presence of missing data continues to be an active area of research.

We employed a score-based method when learning our non-causal structure and when considering TOENAIL_AS as a member of CANCER’s Markov blanket. While constraint-based methods can be more efficient, they can also be sensitive to failures in hypothesis tests. Further, if one has already given up causal claims, then a high-scoring structure is more important than a network that accurately represents causal direction. Of course, the choice of scoring metric is influential in determining which network is “best.” Because we view the process of BN structural learning as primarily an exploratory phase of data analysis that should be followed by rigorous replication studies [62], we do not see a need to rely on scoring criteria, such as the BIC, that penalize harshly for the number of edges.

Our example contained all discrete or easily discretized variables. While we believe this is generally representative of problems concerning the genetic dissection of disease, it may be that results are sensitive to the particular discretization of continuous variables. For example, based on the analysis of Andrew et al. [59], we divided toenail arsenic levels into “low” and “high” based on the 90th percentile of the observed values. There may be reasons why other thresholds would be more predictive or biologically relevant. When there is no clear basis for choosing the thresholds, other investigators have used the quartile boundaries [19]. Continuous variables could also be kept as such, in which case the conditional probability distributions in eq. (1) are represented by conditional density functions [35]. This does not necessarily present a problem for learning BN parameters for a given structure, as statistical methods – including the handling of missing data – are readily available. However, structural learning becomes significantly more difficult when variables are continuous, as the number and type of possible dependence relations and interactions becomes infinite. However, under some assumptions, such as linear relations and conditionally normal distributions, effective algorithms have been worked out [63]. Using kernel density estimators to model the conditional distribution, Hofmann and Tresp [64] were able to eliminate the reliance on normality. Further research is being conducted on this practically relevant topic [38].

We constrained ourselves here to discussing networks containing only variables for which there are some recorded data. Of course, in most cases, not all relevant aspects of a problem have been observed. Such *hidden variables* can present a problem for network structural learning, as omission of nodes effectively amounts to marginalization of the underlying joint distribution, potentially leading to complex dependencies among the remaining, observed variables. For example, in the gene-environment-disease context, data may be available on several environmental biomarkers and health outcomes, as well as a number of predisposing genetic or sociocultural factors. The relation between these is presumably mediated by the level of exposure to an environmental stressor, rendering many of the effects and predisposing factors conditionally independent. However, if the actual exposure level is not being measured, then all the observed variables will appear to be related to each other, likely in complex ways. By explicitly including a node representing a hidden variable in a network – even if there are no recorded data on that variable – the learned models are likely to be simpler and less prone to overfitting [65]. Fortunately, the structural EM algorithm can handle hidden variables analogously to how it handles missing values of otherwise observed quantities [65]. However, how to choose the number and placement of hidden variables in a BN remains an active area of research.

Despite being applied to Bayesian networks, the process of structural learning as we have described it this far has not been truly *Bayesian* in spirit. That is, we have not incorporated *prior* knowledge regarding possible structures or attempted to calculate *posterior* probabilities of model structures or features (e.g. the presence of a particular edge or the value of a particular parameter). As mentioned in Section Methodological approaches, the BIC score represents an approximation of the full marginal likelihood, which is comparable to the Bayesian posterior with a vague prior. The K2 score also represents a special case of the Bayesian posterior resulting from a uniform Dirichlet prior. A further generalization, still assuming a Dirichlet distribution but with prior knowledge able to be incorporated as an *equivalent sample size* was developed by Heckerman et al. [66] and is referred to as the Bayesian Dirichlet equivalent (BDe) score. Friedman [65] describes a version of the structural EM algorithm that directly addresses Bayesian model selection.

Even when the model selection process is Bayesian, in that it incorporates prior knowledge, typically only the single structure with the greatest posterior probability is maintained for further prediction and inference. However, when the amount of data (i.e. number of observations) is small relative to the size of the network (i.e. number of nodes or edges), there are likely to be many structures that conform with the data nearly equally well [67]. In such a situation – which is the norm when working with gene-environment-disease data with many SNPs relative to the sample size – the choice of a single “best” structure is largely arbitrary. Another data manifestation from the same population could have led to a very different final model. Zhang [68] has developed a fully Bayesian graphical method for large-scale association mapping, called BEAM3, which yields posterior probabilities of association. Yet, in most situations there are exponentially many structures that provide a “reasonable” representation of the observed data, making comprehensive investigation or communication of all “good” structures (i.e. those with a non-negligible posterior probability) impossible. For this reason, Friedman and Koller [67] propose a method for computing the Bayesian posterior of a particular structural feature, defined as the total posterior probability of all structures that contain the feature. Such features might include the presence of a particular edge between two nodes, the choice of a node’s parents, or the Markov blanket of a node. In some cases, robust assessment of such features may be more relevant to biological discovery than full articulation of (potentially fragile) network structures.