This article has Open Peer Review reports available.
Discovering feature relevancy and dependency by kernelguided probabilistic modelbuilding evolution
 Nestor Rodriguez^{1} and
 Sergio Rojas–Galeano^{1}Email author
Received: 5 August 2016
Accepted: 14 February 2017
Published: 15 March 2017
Abstract
Background
Discovering relevant features (biomarkers) that discriminate etiologies of a disease is useful to provide biomedical researchers with candidate targets for further laboratory experimentation while saving costs; dependencies among biomarkers may suggest additional valuable information, for example, to characterize complex epistatic relationships from genetic data. The use of classifiers to guide the search for biomarkers (the so–called wrapper approach) has been widely studied. However, simultaneously searching for relevancy and dependencies among markers is a less explored ground.
Results
We propose a new wrapper method that builds upon the discrimination power of a weighted kernel classifier to guide the search for a probabilistic model of simultaneous marginal and interacting effects. The feasibility of the method was evaluated in three empirical studies. The first one assessed its ability to discover complex epistatic effects on a large–scale testbed of generated human genetic problems; the method succeeded in 4 out of 5 of these problems while providing more accurate and expressive results than a baseline technique that also considers dependencies. The second study evaluated the performance of the method in benchmark classification tasks; in average the prediction accuracy was comparable to two other baseline techniques whilst finding smaller subsets of relevant features. The last study was aimed at discovering relevancy/dependency in a hepatitis dataset; in this regard, evidence recently reported in medical literature corroborated our findings. As a byproduct, the method was implemented and made freely available as a toolbox of software components deployed within an existing visual data–mining workbench.
Conclusions
The mining advantages exhibited by the method come at the expense of a higher computational complexity, posing interesting algorithmic challenges regarding its applicability to large–scale datasets. Extending the probabilistic assumptions of the method to continuous distributions and higher–degree interactions is also appealing. As a final remark, we advocate broadening the use of visual graphical software tools as they enable biodata researchers to focus on experiment design, visualisation and data analysis rather than on refining their scripting programming skills.
Keywords
Background
Relevancy estimation techniques are aimed at finding subsets of markers in datasets with a high number of dimensions, where noisy, redundant or irrelevant variables abound. The selected features may become targets of more detailed studies requiring expensive experimentation or human expertise, thus saving costs and time not spent on the discarded variables. This problem of selecting the relevant variables can be regarded as a search procedure over the space of all possible combinations of variable subsets, therefore, an NPHard problem [1]; similarly, finding the underlying structure of a graph representing dependencies between those variables is also combinatorial [2]. Thus the need of using approximating, iterative methods is an alternative to find suitable solutions.
Research in techniques for discovering relevant variables is a very active field in the data mining community (see e.g. [2–7]), and has attracted much attention in the last two decades [8]. The filter approach ranks the variables according to a linear criterion such as their correlation to the prediction target; they are computational simple, but usually fails to capture non–linear patterns of discrimination.
In contrast, the wrapper approach performs the search guided by the classification accuracy of a discriminant rule that evaluates the suitability of a subset of features; although computationally more demanding, this setting is able to find smaller and more discriminative sets of relevant features, specially when non–linear concepts are hidden. In this respect several approaches have been proposed previously using different metaheuristics, probabilistic assumptions and discrimination techniques. One particular flavor uses probabilistic model–building genetic algorithms combined with well–known classifiers (a review of applications of this approach in the bioinformatics domain can be found in [9]). This kind of algorithms simultaneously estimate the parameters of a probabilistic relevance model of the variables and the structure of a graph representing relationships among them. Recent studies using a Bayes network as such model have reported promising results in discovering interactions in genetic data [2, 6, 10, 11].
In a similar vein, here we describe a novel method that models relevancy and dependency by coupling a weighted kernel machine for pattern classification [12] into a probabilistic–based genetic algorithm [13] for dependency estimation. Previous studies considered combining classical and probabilistic genetic algorithms with weighted kernel classifiers for relevancy–only discovery [14, 15]; our contribution in this paper is to extend those approaches to take advantage of the discrimination power of a weighted kernel classifier to guide the search for a probabilistic model that simultaneously estimates marginal and interacting effects among the features in a discrimination problem.
Method
Previous work
Overview of probabilisticbased genetic algorithms
This kind of genetic algorithms are stochastic search techniques that evolve a probability distribution model from a pool of solution candidates, rather than evolving the pool itself. The distribution is adjusted iteratively with the most promising (suboptimal) solutions until convergence. Hence, they are also known as Estimation of Distribution Algorithms (EDA, for short). The generic estimation procedure is shown in Algorithm 1. Step (1) initialises the model parameters θ. Step (2) is the loop that updates the parameters θ until convergence. Step (3) samples a pool \(\mathcal {S}\) of n candidates from the model. Step (4) ranks the pool according to a cost function f(·) and chooses the topranked into \(\mathcal {B}\). Step (5) reestimates the parameters θ from this subset of promising solutions.
The actual realisation of each step in the generic template of Algorithm 1 defines different types of EDA s: discrete or continuous parameters; binomial, multinomial or Gaussian distributions; univariate, bivariate or multivariate dependencies, see among others: [14, 16–23]. Our approach to the problem of interest is based on the following assumptions: discrete parameters, Gaussian distributions, bivariate dependencies.
 1.
Build a disconnected graph \({\mathcal G}({\mathbf {V}}, {\mathbf {E}}_{t})\), with V the set of problem variables, and E _{ t } the set of variable interactions determined by a bivariate Pearson χ ^{2} dependency test criterion: \({\mathbf {E}}_{t} = \{(i, j) \in {\mathbf {V}} \times {\mathbf {V}} : i \neq j \land \chi _{ij}^{2} \geq 3.84\}\). Here the statistic χ ^{2} is computed from the current candidate pool \(\mathcal {B}\) at iteration t.
 2.
Compute MSF(E _{ t }) representing variable dependencies. Build the set of root nodes R _{ MSF } by choosing at random one node of every component in MSF(E _{ t }).
 3.Estimate parameters {ρ _{ i }:i∈R _{ MSF }} and {ρ _{ ij }:(i,j)∈E _{ MSF }} using frequentist updates (see Eq. (2)), again over the current candidate pool \(\mathcal {B}\) at iteration t (N.B. Here, [c]=1 if the argument c is true or 0 otherwise).$$\begin{array}{*{20}l} \rho_{i}^{a} = \sum_{k = 1}^{n} \left[\mathcal{B}_{ki} = a\right],\quad \rho_{ij}^{ab} =\sum_{k = 1}^{n} \left[\mathcal{B}_{ki} = a \land \mathcal{B}_{kj} = b\right], \end{array} $$(2)
Overview of kernel machines for pattern classification
As it was mentioned earlier, we use a classifier that guides the search for more suitable subsets of features (Step 4 of Algorithm 1). From the many pattern classification techniques, kernel machines [12] have shown outstanding performance on diverse problem domains; thus we chose them as base classifiers for our method.
Kernel machines classify patterns using a linear combination of nonlinear mappings, known as kernel functions, evaluated on the current input instance i over the observed instances in the past j<i, using the rule \(\hat {y}_{i} = \mathsf {sign}\left (\sum _{j=1}^{i1} \alpha _{j} \kappa ({\mathbf x}_{j},{\mathbf x}_{i})\right)\), where \(\hat {y}_{i}\) is the class prediction and the coefficients {α _{ j }} are learnt with standard linear discriminant algorithms such as the Perceptron or the SVM [12].
Modified weighted versions of these kernels incorporate scale factors w={w _{1},…,w _{ ℓ }:w _{ i }∈ [ 0,1]} for each of the ℓ dimensions (i.e. variables) in order to modulate their contribution to the total computation [14, 26, 27]. The weighted RBF and weighted polynomial kernels are then defined as \(\kappa _{\sigma }({\mathbf x},{\mathbf z};{\mathbf w}) = \mathsf {exp}\left (\sigma \sum _{i=1}^{\ell } w_{i}(x_{i}z_{i})^{2} \right)\), and \(\kappa _{d}({\mathbf x},{\mathbf z};{\mathbf w}) = \left (\sum _{i=1}^{\ell } w_{i}(x_{i} \cdot z_{i}) \right)^{d}\), respectively. Here we remark that as w _{ i } is closer to 1, its associated variable becomes more relevant since it contributes a larger magnitude to the final value of the kernel computation. It is in this sense that we interpret the vector w as representing the relevancy distribution of the variables for the purposes of classification. Accordingly, classification performance will guide the EDA to estimate these relevancy factors.
where \(\widetilde {{\mathbf w}} = \{\sqrt {w_{1}},\ldots,\sqrt {w_{\ell }}\}\) and ⊗ denotes the component–wise product. The case of the weighted polynomial kernel is analogous. This observation was originally pointed out in [14] and more recently in [27].
Related methods for relevancy and dependency estimation
Some previous studies have considered a multiobjective approach for simultaneous optimisation of accuracy and relevance distribution. Two representative techniques utilise EDA s to estimate the parameters of a probability model from which the relevance factors are sampled. One of such approaches, the EBNA algorithm [28]) uses a multivariate probabilistic model that incorporates second–order dependencies between the variables. The search of relevancy and dependencies is guided by the discriminatory power of a NaiveBayes classifier. The authors report promising results in finding suitable variable subsets with good generalisation performance, although the benefit of obtaining insights about relationships between variables is traded–off with an overhead in computational complexity.
On the other hand, wKiera is a wrapper approach for feature relevance estimation that combines EDA s and kernel machines [14]. The estimation is carried out using an array of scale factors coupled to a weighted kernel machine whose classification accuracy guides the search for the relevance distribution using an UMDA algorithm [13]. The authors reported encouraging results compared to filter methods in discovering relevant variables on a number of different classification tasks, including problems with linear and nonlinear hidden concepts in very–high dimensional spaces. The algorithm however, does not retrieve additional information about the interactions between the relevant variables, because it assumes they are conditionally independent. In this respect, wKiera differs from the method we propose in this paper which, despite combining also a kernel machine with an EDA, estimates relevance based on a probabilistic model of bivariate interactions, obtaining a network of dependencies that may provide additional insights regarding the combined effects of related features, as we shall explain in the next section.
Lastly, another wellknown approach to treat dependencies is ReliefF [29]. This technique has been used to estimate feature quality in prediction and regression tasks. It can be applied as a previous step (filter) to feature subset selection. In contrast to other filter techniques assuming conditional independence of the features (i.e. correlation coefficient, information gain or Gini index), ReliefF detects local context interactions between variables and use that information during estimation of their relevancy. In this way, it is able to analyse combined effects due to dependencies among relevant features.
The scores computed by ReliefF are positive for relevant features and negative for irrelevant ones. Although it does not provide explicit information about the dependencies, this technique has proven fast and effective for relevance estimation on problems with strong feature interactions, where other filters become myopic and fail to find them [30].
Proposed algorithm

When building the correlation graph, the Mutual Information (MI) criterion [31] was additionally considered to estimate dependencies between arbitrary pairs of variables, that is, to the extent to which they share information. A third Combined Mutual Information and pvalue (SIM) criterion was also considered; the latter mixes both statistical and information–theory dependence [32]. Consequently, the rule to compute the edges on the dependency network was modified to that in Eq. (3):$$ {\mathbf{E}}_{t} = \left\{(i, j) \in {\mathbf{V}} \times {\mathbf{V}} : i \neq j \land \mathsf{any\_of} \left(\left\{\chi^{2}_{ij} \geq 3.84, \mathsf{MI}_{ij} > 0, \mathsf{SIM}_{ij} > 0\right\}\right)\right\} $$(3)

When choosing the root nodes R _{ MSF } of the dependency network (forest), instead of selecting at random we introduced another information–theory criterion that selects nodes minimising the marginal entropy H(·) in each connected component V _{ k } of the network, \(\bigcup _{k} {\mathbf V}_{k} = {\mathbf V} ~\land ~ \bigcap _{k} {\mathbf V}_{k}\ = \emptyset \), as stated in Eq. (4). The marginal entropy is computed frequentistwise from the current candidate pool \(\mathcal {B}\) at iteration t. The rationale behind the introduction of this criterion is that those nodes with lowest entropy are richer in information content, and thus good candidates to become independent parents of the dependency subnetworks (in this sense this criterion was originally proposed in the MIMIC algorithm [17]).$$ {\mathbf R}_{\mathsf{MSF}} =\left\{r_{k}: r_{k} = \underset{i}{\arg\min}\ H\!\left(X_{i} \in {\mathbf V}_{k}\right)\right\} $$(4)

Finally, candidate relevance factors are sampled from the current probability model and incorporated to a population including the previous best solutions found: \(\mathcal {S} \gets \mathsf {sample}(P(X;\boldsymbol {\theta }),\frac {n}{2})\cup \mathcal {B}\). Each candidate in this population \({\mathbf w}_{k} \in \mathcal {S}\) is assessed by building a weighted SVM and obtaining its classification accuracy with a 5–fold cross–validation on the modified dataset \(\widetilde {D} \gets D \otimes {\mathbf w}_{k}\). The population is ranked by best accuracies and the top candidates are selected. These candidates are then used to re–estimate the dependency network and the relevance parameters of the probabilistic model, and the process iterates until these parameters converge. Further details and specification of Kiedra are given in the Additional file 1: Additional Methods and Tools section.
Empirical study
In this section we report the results of a number of experiments designed to validate the feasibility of the proposed method. Initially we provide details about our implementation platform. Then we describe the first experiment aimed at testing the ability of the method in discovering epistasis on generated human genetic datasets; there we used ReliefF as a baseline to compare, for it is also a method that treats feature interactions. Our empirical study continued with a second experiment designed to compare the proposed approach with other EDAbased and kernelbased methods on some benchmark classification problems. Lastly, we conducted a third experiment intended again to discover relevancy and dependencies in a medical domain, specifically on a hepatitis dataset, this time corroborating the results with recent findings in the literature of that disease.
Implementation
Experiment 1: Relevancy and dependency in genetic epistasis
Genetic epistasis refers to those complex genegene interactions that may trigger susceptibility to a common human disease. Instead of characterising a single nucleotide polymorphism (SNP) as an isolated marker of a disease, epistasis assumes a combination of markers is in fact associated to the phenotypic manifestation of the disease. Hence, epistasis in genetic datasets is an interesting target for simultaneous relevance and dependencies mining.
Description of simulated epistatic problems (see [35] for further details)
Problem  Datasets  Simulated SNPs  Noisy SNPs  Disease status  Instances 

3way  100  3  5 and 10  0 or 1  3000 (balanced) 
4way  100  4  5 and 10  0 or 1  3000 (balanced) 
4wayNL  100  4  5 and 10  0 or 1  3000 (balanced) 
5way  100  5  5 and 10  0 or 1  3000 (balanced) 
5wayNL  100  5  5 and 10  0 or 1  3000 (balanced) 
For each dataset, the experiment was conducted following the scheme shown in Fig. 4. On the one hand, once the noise was added, features in the polluted dataset were scored using ReliefF; those obtaining positive scores were labeled as relevant, otherwise as irrelevant. On the other hand, Kiedra experiments were executed as follows. Firstly the modified dataset was split in training and testing subsets (25%/75%); then an SVM was parameterised (C=100, RBF kernel with σ=10) and wrapped–up in a weighted kernel machine along with the data. The latter was then provided as the cost function to evolve a BMDA (n=20,i t e r=80, SIM criterion). In view of the stochastic nature of Kiedra, the above protocol was repeated 30 times with the resulting scale factors being collected and averaged; then a cut–off threshold of 0.7 was applied to select relevant from irrelevant features.
Let us examine first the plots in the lefthand column of the figure. In the first two problems, 3way and 4way, Kiedra was able to discover the correct number of relevant and noisy features within the whole collection of datasets. Likewise, in problems 4wayNL and 5way, Kiedra only missed a few 2 datasets in each problem. ReliefF in turn, discovered correctly up to 84, 97 and 93 datasets in problems 3way, 4way and 5way, achieving a lower rate of 61 hits for problem 4wayNL. These results hint at the ability of Kiedra to discover epistatic effects even with coexisting uninformative markers. ReliefF shows a comparable trend, except in problem 4wayNL where probably the higher–order dependencies causes some trouble so as to find the correct relevant features.
Now, let us discuss the results shown in the plots of the right–hand column of the figure. We recall that these problems were modified with twice the number of polluted features. A similar trend can be seen for the behaviour of Kiedra. The method achieved a correct hit rate of 98 out of 100 in problems 3way and 4way; this rate was down to 80 in problem 4wayNL and slightly lower in 5way. On the other hand, ReliefF was adversely affected by such level of noise, for it obtained correct hit rates of 44, 39, 3 and 12 respectively.
Finally, let us comment on the plots of the last row in the figure (problem 5wayNL), whose results differ amply from those reported previously. In this problem, ReliefF seems to perform better in comparison with Kiedra, although not a conclusive trend is evident. In the 5noisy features problem it is able to find the correct coordinate (5,5) in about 10 hits, but the most frequent result was coordinate (4,4) with 21 datasets. Likewise in the 10noisy features problem the most frequent discovered coordinate was (3,5) while the optimal (5,10) was never found. On the other hand, Kiedra clearly underperformed in this problem as their findings are highly biased to coordinates where the correct number of noisy features are identified, but in contrast few or none of the relevant are found. We recall that the originators of these epistatic datasets reckon that this is the hardest problem, as its etiology comprises interactions of the entire set of 5 SNPs, while no lowerdegree interactions were enabled, as opposed to the other problems [35]. We remark however, that the dependency graph in which Kiedra is based assumes a bi–variate probability distribution, which may explain why it fails in modeling higher–order interactions appropriately.
Experiment 2: Feature relevance discovery in benchmark classification problems
Description of benchmark classification datasets
Dataset  Variables  Classes  Instances 

Ionosphere  34  2  351 
Soybean  35  19  307 
Horse–colic  27  24  368 
Annealing  38  6  798 
Image  19  7  2310 
One experiment was conducted per each dataset as follows. The dataset was initially preprocessed as to fill–in missing values with a Naive Bayes classifier and to normalise within a [ 0,1] real interval. The processed dataset was then randomly split into training and test subsets of equal size. These subsets are the inputs for the cross–validation scheme used to estimate the accuracy of each candidate solution. For each method, 10 repetitions were executed with different random splits. Average statistics were collected using the BlackBoxTester widget.
Average number of relevant variables discovered in each dataset
Dataset  Raw  EBNA  wKiera  Kiedra 

Ionosphere  34  13.40±2.11  7.30±0.82  7.20±0.92 
Soybean  35  6.10±1.85  5.10±0.99  6.50±1.58 
Horse–colic  27  18.90±2.76  16.40±1.90  16.60±2.12 
Annealing  38  20.50±3.13  9.60±0.97  9.40±1.43 
Image  19  8.00±0.66  7.72±1.24  7.45±1.36 
Average prediction accuracy in each dataset
Dataset  EBNA  wKiera  Kiedra 

Ionosphere  92.40±2.04  98.07±1.36  98.49±0.9 
Soybean  83.93±1.58  90.31±1.94  89.89±2.31 
Horse–colic  88.64±1.70  82.31±2.78  81.42±3.81 
Annealing  94.10±3.0  76.3±3.76  76.42±4.49 
Image  88.98±0.98  89.55±1.28  90.29±1.78 
Average runtime performance (only available for wKiera and Kiedra)
Dataset  wKiera  Kiedra  

Evaluations  Time (secs.)  Evaluations  Time (secs.)  
Ionosphere  2893.8±331.59  93.11±30.21  3009.00±838.81  87.62±36.06 
Horse colic  1826.20±251.74  43.02±13.19  2258.20±507.71  75.07±25.68 
Soybeanlarge  3501.00±437.77  151.04±38.42  4747.00±1647.87  141.34±48.16 
Annealing  1059.40±159.38  87.76±15.42  1160.20±204.92  67.41±14.11 
Image  1246.60±299.86  31.14±9.15  1541.80±860.79  44.76±24.40 
Experiment 3: Relevancy and dependency estimation on a medical domain
The experiment was implemented using the same Kiedra testbed of Fig. 4; we simply changed the data source. The data was preprocessed and sampled for training and testing subsets as before. This protocol was repeated 100 times in order to prevent biased results due to randomness in the proposed method. Relevancy factors of the best solution found in each repetition were collected; then, variables discovered in more than half of the repetitions yielding accuracies greater than 80%, were selected as relevant (see the relevancy heatmap of Fig. 7(b)).
Additional variability was induced by shuffling the order of the patients in the dataset in each repetition. As a result we noticed dissimilar dependency trees were found. Thus, these trees were aggregated into a single graph, accounting the strengths of the dependencies as proportional to the number of times they showed up during the repetitions (see Fig. 7(c)). Lastly, in order to estimate the final pairwise dependencies, we computed the minimumspanningtree on this aggregated graph, using the inverse of the counts as edge costs and applying Kruskal’s algorithm [38] (see Fig. 7(d)).
Kiedra found a total of nine relevant variables: Sex (X _{2}), Malaise (X _{6}), Liver big (X _{8}), Liver firm (X _{9}), Spleen palpable (X _{10}), Spiders visible (X _{11}), Ascites occurrence (X _{12}), Albumin level (X _{17}) and Prognosis (X _{19}). Besides, according to Fig. 7(b), the method found strong evidence of relevancy in subset {X _{6},X _{19}}, followed by fair evidence of relevancy in {X _{8},X _{9},X _{10},X _{11},X _{12}}, and lastly, borderline evidence in {X _{2},X _{17}}. The first subset indicates, not surprisingly, that prognosis by histology is probably the most effective predictor of the disease, although being expensive and risky of complications [39]; similarly, malaise is seemingly correlative with the disease and it is a symptom usually reported by patients [40].
In contrast, the second subset correspond to more diseasespecific symptoms: hepatomegaly (liver oversizing and stiffness) and splenomegaly (spleen enlargement) commonly reflect severity of liver damage [39, 41], spider nevi are visible in patients with the different variants of the disease [42, 43], and ascites has been reported as being strongly associated with hepatic dysfunctions [44]. Regarding the last subset, albumin is a protein synthesised in the liver, so it is reasonable to correlate changes in its level with infection with hepatitis. The peculiar finding here is Sex (X _{2}), a nondiseasespecific variable that nonetheless, has been recently linked to treatment response and survival rates with other unexpected features such as race (female, white) [39, 44]. In addition, it is worth noting that the relevant dependencies found by our method are between this variable and the other diseasespecific X _{9},X _{10},X _{11},X _{12} predictors mentioned earlier.
We remark that these findings are corroborated by other related studies, such as [45] suggesting that variables X _{19}, X _{11}, X _{17} and X _{12} were highly indicative of the diagnosis. Likewise, [46] applied a method that discovered the subset of variables {X _{6},X _{17},X _{14},X _{19},X _{11}} as relevant, with further experimentation finding predictive value in variable X _{2}. Other studies using information theoretic, statistical and regularisation learning methods [47] as well as various machine learning and bioinspired techniques [48] also reported these variables in their relevant subsets, or report subsets with similar sizes (10–12 variables) obtaining similar prediction accuracies between 80–85% [49].
On the other hand, the following variables were characterised as not explanatory by Kiedra: Age (X _{1}), Steroids detected (X _{3}), Antivirals applied (X _{4}), Fatigue (X _{5}), Anorexia (X _{7}), Varices exposed (X _{13}), Bilirrubin level (X _{14}), Alkaline PO4 level (X _{15}), SGOT level (X _{16}) and Protein level (X _{18}). From this subset, it causes surprise bilirrubin not being discovered as indicative of the disease, as this protein is responsible of jaundice, the most common symptom related to fulminant hepatitis; we speculate that this may be due to the fact that bilirrubin levels differ depending on the type of illness and duration: acute or chronic, viral, drug–induced or autoimmune [44]. Unfortunately, in this dataset such information was not available. The other potential marker included in this subset is the alkaline phosphate level; however, some clinical studies have shown that this enzyme maintain normal levels during hepatitis infection, although it may raise in other hepatic–related injuries such as cholestasis [40].
No further evidence of other data mining studies assigning relevancy in the remainder variables was found [37, 47, 48]. Notice that consequently, we also regarded the dependencies associated with these variables as not relevant for the prediction of the disease outcome.
Conclusion
We have described a method to tackle the dual combinatorial problem of relevancydependency discovery by coupling a weighted kernel classifier to guide the evolution of a probabilistic model of marginal and interacting effects among the problem features. Empirical evidence found in two experiments, one in a genetic epistasis testbed and another in a classification benchmark, indicates comparable performance with related baseline methods while providing richer dependency and relevancy information; in a third experiment comprising a hepatitis dataset, the method findings were corroborated with those reported in recent medical literature.
The promising potential mining capabilities of the method come at the expense of higher computational complexity of the algorithmic and data structures that it involves. In view of the nowadays increasingly availability of high–throughput and stream technologies for data acquisition, natural questions emerge in regards to the applicability of the method in large–scale scenarios. In this respect, we envisage two interesting avenues for further research, the first one related to algorithmic crafting so as to speed up the computation of the kernel function, which might be a bottleneck in such big data scenarios e.g. [14, 27, 50, 51]; the second one is considering compact representations of the probability model enabling memory and time savings during updating of its parameters [23, 52–54].
On a different perspective, the current design of the method is restricted to discrete probabilistic models; therefore modeling continuous distributions with its associated computational challenges, is also of significant interest. Besides, since the probabilistic model assumes a bivariate distribution, the method is prone to miss epistasis due to higher–order interactions, as it was shown in the hardest genetic problem in the empirical study. Thus, future work would also consider addressing this limitation.
As a final word, we also advocate adopting user–friendly visual graphical data–mining tools enabling biomedical analysts to focus on their experiments rather than on improving their low–level programming skills (see [55] for deeper insights on visual programming environments for bioinformatics). Hence, an additional challenge arising is growing and refining the suite of visual software components that currently implements the method.
Declarations
Acknowledgements
Not applicable.
Funding
Not applicable.
Availability of data and materials
The epistatic humanlike gentic datasets are available at: https://github.com/greenelab/modelfreedata(last visit: December 08/2016). The benchmark classifications and hepatitis datasets are available in the UC Irvine Machine Learning Repository (available at: http://archive.ics.uci.edu/ml/, last visit: December 08/2016). The software used for experimentation in this article is publicly available as:
Project name: Goldenberry. (http://goldenberrylabs.org, last visit: December 08/2016).
Version: 2.0
Operating systems: Platform independent
Programming language: Python, Qt
Other requirements: Orange version 2.7
License: Simplified BSD License
Authors’ contributions
SRG conceived the problem and the hybrid dependency–relevancy method, designed the study, interpreted the results, and led the drafting of the manuscript. NR developed the software used to implement the method, conducted the experiments, collected the results, and helped interpreting the results. All authors read and approved the final manuscript.
Authors’ information
NR contribution was done while he was a graduate student at Universidad Distrital FJC, School of Engineering.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Davies S, Russell S. NPcompleteness of searches for smallest possible feature sets. In: AAAI Symposium on Intelligent Relevance. Palo Alto: AAAI Press: 1994. p. 37–9.Google Scholar
 Gheisari S, Meybodi MR, Dehghan M, Ebadzadeh MM. Bayesian network structure training based on a game of learning automata. Intl J Mach Learn Cybernet. 2016; Online first:1–13.Google Scholar
 Aldehim G, Wang W. Determining appropriate approaches for using data in feature selection. Intl J Mach Learn Cybernet. 2016; Online first:1–14.Google Scholar
 BolónCanedo V, SánchezMaroño N, AlonsoBetanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Inf Sci. 2014; 282(0):111–35.View ArticleGoogle Scholar
 Szymczak S, Holzinger E, Dasgupta A, Malley JD, Molloy AM, Mills JL, Brody LC, Stambolian D, BaileyWilson JE. r2vim: A new variable selection method for random forests in genomewide association studies. BioData Mining. 2016; 9(1):1–15.View ArticleGoogle Scholar
 Li J, Malley JD, Andrew AS, Karagas MR, Moore JH. Detecting genegene interactions using a permutationbased random forest method. BioData Mining. 2016; 9(1):1–17.View ArticleGoogle Scholar
 Taguchi YH. Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression. BioData Mining. 2016; 9(1):1–23.View ArticleGoogle Scholar
 Guyon I, Nikravesh M, Gunn S, Zadeh LA. Feature Extraction: Foundations and Applications. Berlin Heidelberg: Springer; 2006.View ArticleGoogle Scholar
 Armañanzas R, Inza I, Santana R, Saeys Y, Flores JL, Lozano JA, Peer Y. V. d., Blanco R, Robles V, Bielza C, Larrañaga P. A review of estimation of distribution algorithms in bioinformatics. BioData Mining. 2008; 1(1):1–12.View ArticleGoogle Scholar
 Su C, Andrew A, Karagas MR, Borsuk ME. Using bayesian networks to discover relations between genes, environment, and disease. BioData Mining. 2013; 6(1):1–21.View ArticleGoogle Scholar
 MotsingerReif AA, Deodhar S, Winham SJ, Hardison NE. Grammatical evolution decision trees for detecting genegene interactions. BioData Mining. 2010; 3(1):1–15.View ArticleGoogle Scholar
 ShaweTaylor J, Cristianini N. Kernel Methods for Pattern Analysis. New York: Cambridge University Press; 2004.View ArticleGoogle Scholar
 Larrañaga P, Lozano JA. Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Boston: Kluwer Academic Publishers; 2002.View ArticleGoogle Scholar
 RojasGaleano S, Hsieh E, Agranoff D, Krishna S, FernandezReyes D. Estimation of relevant variables on highdimensional biological patterns using iterated weighted kernel functions. PLoS ONE. 2008; 3(3):1806.View ArticleGoogle Scholar
 Rojas S, FernandezReyes D. Adapting multiple kernel parameters for support vector machines using genetic algorithms. In: Proceedings of the 2005 IEEE Congress on Evolutionary Computation. Edinburgh: IEEE: 2005.Google Scholar
 Muhlenbein H, Paag G. From recombination of genes to the estimation of distributions: I. binary parameters In: Voigt HM, Ebeling W, Rechenberg I, Schwefel HP, editors. Parallel Problem Solving from Nature PPSN IV. Lecture Notes in Computer Science. Berlin Heidelberg: Springer: 1996. p. 178–87.Google Scholar
 Bonet JSD, Isbell CL, Viola P. Mimic: Finding optima by estimating probability densities. In: Advances in Neural Information Processing Systems. Cambridge: The MIT Press: 1997. p. 424.Google Scholar
 Sebag M, Ducoulombier A. Extending populationbased incremental learning to continuous search spaces. In: International Conference on Parallel Problem Solving from Nature. Springer: 1998. p. 418–427.Google Scholar
 Pelikan M, Müehlenbein H. The bivariate marginal distribution algorithm In: Roy R, Furuhashi T, Chawdhry P, editors. Advances in Soft Computing. London: Springer: 1999. p. 521–35.Google Scholar
 Larrañaga P, Etxeberria R, Lozano JA, Peña JM. Combinatorial optimization by learning and simulation of bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence. Burlington: Morgan Kaufmann: 2000. p. 343–52.Google Scholar
 Pelikan M, Goldberg DE. Hierarchical bayesian optimization algorithm = bayesian optimization algorithm + niching + local structures. Burlington: Morgan Kaufmann: 2001. p. 525–32.Google Scholar
 Nakao M, Hiroyasu T, Miki M, Yokouchi H, Yoshimi M. Realcoded estimation of distribution algorithm by using probabilistic models with multiple learning rates. Procedia Comput Sci. 2011; 4:1244–51.View ArticleGoogle Scholar
 Neri F, Iacca G, Mininno E. Compact optimization. In: Handbook of Optimization. Berlin Heidelberg: Springer: 2013. p. 337–64.Google Scholar
 Aizerman MA, Braverman EA, Rozonoer L. Theoretical foundations of the potential function method in pattern recognition learning. Autom Remote Control. 1964; 25:821–37.Google Scholar
 Cortes C, Vapnik V. Supportvector networks. Mach Learn. 1995; 20:273–97.Google Scholar
 Chapelle O, Vapnik V, Bousquet O, Mukherjee S. Choosing multiple parameters for Support Vector Machines. Mach Learn. 2002; 46:131–59.View ArticleGoogle Scholar
 Tan M, Tsang IW, Wang L. Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res. 2014; 15(1):1371–429.Google Scholar
 Inza I, Larrañaga P, Etxeberria R, Sierra B. Feature subset selection by bayesian networkbased optimization. Artif Intell. 2000; 123(1–2):157–84.View ArticleGoogle Scholar
 Kononenko I, Šimec E, RobnikŠikonja M. Overcoming the myopia of inductive learning algorithms with relieff. Applied Intell. 1997; 7(1):39–55. doi:10.1023/A:1008280620621.View ArticleGoogle Scholar
 RobnikŠikonja M, Kononenko I. Theoretical and empirical analysis of relieff and rrelieff. Mach Learn. 2003; 53(12):23–69.View ArticleGoogle Scholar
 Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Trans Pattern Anal Mach Intell. 2005; 27:1226–38.PubMedView ArticleGoogle Scholar
 Perlich C, Rosset S. Identifying bundles of product options using mutual information clustering. In: SDM. Philadelphia: SIAM: 2007. p. 390–7.Google Scholar
 Demšar J, Curk T, Erjavec A, Hočevar T, Milutinovič M, Možina M, Polajnar M, Toplak M, Starič A, Štajdohar M, Umek L, žagar L, žbontar J, žitnik M, Zupan B. Orange: Data mining toolbox in python. J Mach Learn Res. 2013; 14:2349–353.Google Scholar
 GarzónRodriguez LP, Diosa HA, RojasGaleano S. Deconstructing GAs into Visual Software Components. In: Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference. GECCO Companion ’15. New York: ACM: 2015. p. 1125–32.Google Scholar
 Himmelstein DS, Greene CS, Moore JH. Evolving hard problems: Generating human genetics datasets with a complex etiology. BioData Mining. 2011; 4(1):21.PubMedPubMed CentralView ArticleGoogle Scholar
 Bache K, Lichman M. The UCI repository of Machine Learning databases. 2013.Google Scholar
 Diaconis P, Efron B. Computer intensive methods in statistics. Sci Am. 1983; 248(5):116–31.View ArticleGoogle Scholar
 Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to Algorithms, 3rd edn. Cambridge: MIT Press; 2009.Google Scholar
 Rosen HR. Chronic hepatitis C infection. N Engl J Med. 2011; 364(25):2429–38.PubMedView ArticleGoogle Scholar
 Ryder S, Beckingham I. ABC of diseases of liver, pancreas, and biliary system: acute hepatitis. BMJ Br Med J. 2001; 322(7279):151.View ArticleGoogle Scholar
 Kuroda H, Kakisaka K, Oikawa T, Onodera M, Miyamoto Y, Sawara K, Endo R, Suzuki K, Takikawa Y. Liver stiffness measured by acoustic radiation force impulse elastography reflects the severity of liver damage and prognosis in patients with acute liver failure. Hepatology Res. 2015; 45(5):571–7.View ArticleGoogle Scholar
 Younis I, Sarwar S, Butt Z, Tanveer S, Qaadir A, Jadoon NA. Clinical characteristics, predictors, and survival among patients with hepatopulmonary syndrome. Ann Hepatol. 2015; 1(14):354—360.Google Scholar
 Chakradhar G, Sudheer D, Rajeswari G, Sriram K, Siva J, Kumar L, Reddy S. Study of pentoxifylline role on prognosis in patients with acute alcoholic hepatitis. J Evol Med Dental Sci. 2015; 4(07):1098–1111.View ArticleGoogle Scholar
 Bernal W, Lee WM, Wendon J, Larsen FS, Williams R. Acute liver failure: A curable disease by 2024?J Hepatol. 2015; 62(1):112–20.View ArticleGoogle Scholar
 Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001; 16(3):199–231.View ArticleGoogle Scholar
 Tuv E, Borisov A, Runger G, Torkkola K. Feature selection with ensembles, artificial variables, and redundancy elimination. J Mach Learn Res. 2009; 10:1341–66.Google Scholar
 Tsanas A, Little M, McSharry P. A methodology for the analysis of medical data. In: Handbook of Systems and Complexity in Health. New York: Springer: 2013. p. 113–25.Google Scholar
 Subbulakshmi CV, Deepa SN. Medical dataset classification: A machine learning paradigm integrating particle swarm optimization with extreme learning machine classifier. Sci World J. 2015; 2015:1–12.View ArticleGoogle Scholar
 Yildirim P. Filter based feature selection methods for prediction of risks in hepatitis disease. Int J Mach Learn Comput. 2015; 5(4):258.View ArticleGoogle Scholar
 Herbster M. Learning additive models online with fast evaluating kernels. In: Proceedings of the 14th Annual Conference on Computational Learning Theory and and 5th European Conference on Computational Learning Theory. COLT ’01/EuroCOLT ’01. London: Springer: 2001. p. 444–60.Google Scholar
 Maji S, Berg A, Malik J. Efficient classification for additive kernel SVMs. Pattern Anal Mach Intell IEEE Trans. 2013; 35(1):66–77.View ArticleGoogle Scholar
 RojasGaleano S, Rodriguez N. A memory efficient and continuousvalued compact EDA for large scale problems. In: Proceedings of GECCO 2012. New York: ACM: 2012. p. 281–8.Google Scholar
 Iturriaga S, Nesmachnow S. Solving very large optimization problems (up to one billion variables) with a parallel evolutionary algorithm in CPU and GPU. In: Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing. Victoria: IEEE: 2012. p. 267–72.Google Scholar
 Dong W, Chen T, Tino P, Yao X. Scaling up estimation of distribution algorithms for continuous optimization. Evol Comput IEEE Trans. 2013; 17(6):797–822.View ArticleGoogle Scholar
 Milicchio F, Rose R, Bian J, Min J, Prosperi M. Visual programming for nextgeneration sequencing data analytics. BioData Mining. 2016; 9(1):1–17.View ArticleGoogle Scholar