Discovering feature relevancy and dependency by kernel-guided probabilistic model-building evolution

Background Discovering relevant features (biomarkers) that discriminate etiologies of a disease is useful to provide biomedical researchers with candidate targets for further laboratory experimentation while saving costs; dependencies among biomarkers may suggest additional valuable information, for example, to characterize complex epistatic relationships from genetic data. The use of classifiers to guide the search for biomarkers (the so–called wrapper approach) has been widely studied. However, simultaneously searching for relevancy and dependencies among markers is a less explored ground. Results We propose a new wrapper method that builds upon the discrimination power of a weighted kernel classifier to guide the search for a probabilistic model of simultaneous marginal and interacting effects. The feasibility of the method was evaluated in three empirical studies. The first one assessed its ability to discover complex epistatic effects on a large–scale testbed of generated human genetic problems; the method succeeded in 4 out of 5 of these problems while providing more accurate and expressive results than a baseline technique that also considers dependencies. The second study evaluated the performance of the method in benchmark classification tasks; in average the prediction accuracy was comparable to two other baseline techniques whilst finding smaller subsets of relevant features. The last study was aimed at discovering relevancy/dependency in a hepatitis dataset; in this regard, evidence recently reported in medical literature corroborated our findings. As a byproduct, the method was implemented and made freely available as a toolbox of software components deployed within an existing visual data–mining workbench. Conclusions The mining advantages exhibited by the method come at the expense of a higher computational complexity, posing interesting algorithmic challenges regarding its applicability to large–scale datasets. Extending the probabilistic assumptions of the method to continuous distributions and higher–degree interactions is also appealing. As a final remark, we advocate broadening the use of visual graphical software tools as they enable biodata researchers to focus on experiment design, visualisation and data analysis rather than on refining their scripting programming skills. Electronic supplementary material The online version of this article (doi:10.1186/s13040-017-0131-y) contains supplementary material, which is available to authorized users.


METHODOLOGY
Discovering feature relevancy and dependency simultaneously using probabilistic model-building evolution and kernel machines Nestor Rodriguez and Sergio Rojas-Galeano * * Correspondence: srojas@udistrital.edu.co.uk Full list of author information is available at the end of the article

Additional Methods and Tools
Kiedra algorithm The specification of Kiedra is shown in Algorithm 2. The algorithm adheres to the generic EDA template (Algorithm 1), with tailored sampling, selection and estimation steps so as to simultaneously discovering relevancy and dependency; its inputs are a labeled dataset, a kernel classifier instance including a parameterised kernel function (here the parameter p stands for the width σ of an RBF kernel or the degree d of a polynomial kernel; these options can be tuned empirically with respect to the input dataset prior to executing the algorithm). To begin with, the parameters of the probability model are initialised assuming an independent joint distribution (line 1) and the first pool of candidates is sampled from such distribution (line 2). Subsequent candidate pools are obtained by mixing samples from a bivariate binomial model with the current estimated parameters θ (according to Equation (??)) and the most promising candidates kept from previous iterations (line 4). Then, roughly speaking, the algorithm iterates the evaluation of the candidates and re-estimation of the probability model until convergence.

Algorithm 2: Kiedra
, algorithm A, function κ p (·, ·), pool size n 1 θ ← initialise(n) 2 B ← sample(P (X; θ), n 2 ) 3 repeat until θ converges The inner loop in line 8 is the core element of the algorithm, the one that evaluates the classification performance of the candidates, guiding the search for relevant variable subsets. This loop iterates over pool S, plugging in each candidate as scale factors of the original dataset. The scaled dataset is then feed to the function κ p (·, ·) according to Equation (??), and these kernel functions are used to train the kernel classifier A to learn the prediction rule of Equation (??). The actual fitness is obtained as the accuracy of A using a 5-fold cross-validation estimate of the classification accuracy.
The remaining steps execute the iterative updating of the dependency network and relevancy parameters.

Goldenberry suite of visual components
We used the Orange and Goldenberry [1] visual programming tools for data mining [1,2,3]. Orange provides a canvas where visual software components known as widgets can be wired together to execute several stages of a data mining task. Goldenberry is a add-on suite of widgets for stochastic-based search techniques (EDA, GeneticAlgorithms) and kernel classification machines (Perceptron, SVM); we actually extended this suite with new components so as to support Kiedra (e.g. BMDA and FeatureSubset). The Kiedra widgets were actually developed using the Python language and Numpy library for the logical modules, and the PyQt library for the graphical user interface. Additionally, in their implementation we employed multi-tasking concurrent techniques to speedup up execution of the main loop of the algorithm (fitness evaluation involving cross-validation of the classifier obtained with the relevance factors of each candidate from the pool S was run using multiple threads). A snapshot of the canvas and widgets used to perform the experiments can be seen in Figure 5.
The visual program that implements wKiera and Kiedra is shown in Figure 4. The key element here is the WrapperCostFunction widget which is wired to two input components: a Data widget that reads from the original dataset files and shuffles it randomly to provide training and testing subsets, and a LearnerFactory, which instantiates a kernel classifier and is configured with an appropriate tuned kernel function; in our experiments we chose the refined version of the SVM widget provided by Goldenberry, since it allows the definition of customised kernel functions and the creation of multiple classifier instances that can be executed in parallel. Other parameters such as cross-validation number of folds and accuracy/subsetsize trade-off are also configured in the WrapperCostFunction widget, which takes a BMDA optimiser (top of the figure) to perform the search and estimation of relevancy and dependency following Algorithm 2. We observe that wKiera can be implemented almost identically, as it only requires to replace the optimiser by the UMDA widget (bottom of the figure); recall the latter assumes an independent probability model as thus provides only information about the relevancy of the input variables. Functionality description of these widgets are given in Table 1. Table 1 List of Orange and Goldenberry widgets used in the experiments.

File
Load input data from a text file.

Preprocessing
Fill missing values using a Naive Bayes classifier.

Datasampler
Shuffle data and split into training and testing subsets.

SVM
Kernel machine whose kernel parameters are adjusted automatically for the input data by means of a grid-search.

WrapperCostFunction
Trades-off the simultaneous optimisation of accuracy vs. variable subset size (a weighted average with a contribution of 90% from accuracy). It also coordinates the m-fold cross-validation scheme (m = 5).

BMDA
The modified version of BMDA tailored for Kiedra (see Section ??). The pool size was set to four times the number of input variables.

UMDA
The EDA used in wKiera. The pool size again was set to four times the number of input variables BlackBoxTester The component that executes the experiments and collects statistics (10 repetitions were set per experiment).