- Methodology
- Open access
- Published:

# Unsupervised encoding selection through ensemble pruning for biomedical classification

*BioData Mining*
**volumeÂ 16**, ArticleÂ number:Â 10 (2023)

## Abstract

### Background

Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptideâ€™s function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking.

### Results

We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets.

### Conclusion

The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.

## Background

Multi-resistant pathogens are a major threat for modern society [1]. In the last decades, a rising number of bacterial species developed mechanisms to elude efficiency to widely used antibiotics [1]. The importance of developing and implementing alternative strategies is further underpinned by a recent study, which detected a certain baseline resistance in European freshwater lakes [2]. The study confirmed resistance specifically against four critical drug classes in human and veterinary health in freshwater, which is typically considered as a pathogen-free environment [2]. Moreover, already concerning levels of antibiotic resistance in Indian and Chinese lakes emphasize the requirement of alternative biocides [3, 4]. One promising approach to replace or even support common antibiotics refers to the deployment of peptides with antimicrobial efficiency [5]. However, identifying and validating active peptides requires intensive, hence, costly and time-consuming wet-lab work. Thus, in the pre-artificial intelligence (AI) era, the manual classification and verification of antimicrobial peptides (AMPs) engaged researchers. Although the *in vitro* confirmation of activity is still necessary, the application of AI, i.e., in particular machine learning (ML) algorithms, simplifies the identification process drastically and pushed several AMPs to the second or third phase of clinical trials [6]. In addition, online databases provide access to thousands of annotated sequences and pave the way automatic peptide design and classification [7]. For instance, Chung et al. (2019) developed a method, which demonstrated good performance on classifying AMPs using a two-step approach, which first predicts efficiency, and afterward the precise target activity [8]. Another study employed a variational autoencoder to encode AMPs, mapped the probability of being active to a latent space, and predicted novel AMPs [9]. Fingerhut et al. (2020) introduced an algorithm to detect AMPs from genomic data [10]. For more information on computational approaches for AMP classification, we refer to the recent review of Aronica et al. (2021) [11].

However, the prediction of amino acid sequence features is not limited to AMPs. In the literature, one can find various applications, e.g., in oncology for predicting anticancer peptides [12], in pharmacology for the discovery and application of cell-penetrating peptides as transporters for molecules [13], or in immunotherapy, for classifying of pro- or antiinflammatory peptides [14, 15]. Other applications include antiviral peptides [16], or peptides with hemolytic [17] or neuro transmitting activity [18].

Unequivocally, the success of ML methods for the prediction of AMPs was enabled by the development and advances of peptide encodings. Encodings are algorithms mapping the amino acid sequences of different lengths to numerical vectors of an equal length, hence, fulfilling the requirement of many ML algorithms [19]. Moreover, peptides or proteins can be described by their primary structure, i.e., the amino acid sequence, and the aggregation in higher dimensions, denoted as the secondary or tertiary structure. Encodings derived from the primary structure are known as sequence-, and encodings describing a higher-order folding are structure-based encodings. To date, a large number of sequence- and structure-based encodings have been introduced and employed in various studies [19]. A significant amount of encodings has been recently acknowledged by another study, specifically benchmarking these by considering multiple biomedical applications [20]. It turned out that most encodings show acceptable performance, partly also beyond single biomedical domains [20]. In addition, SpĂ¤nig et al. (2021) developed a workflow, which can dramatically reduce the number of initial encodings [20]. However, encoding selection is still challenging, and user-friendly approaches are required.

Furthermore, hyperparameter optimization is additionally aggravated by the model choice. Albeit Support Vector Machines (SVM) and Random Forests (RF) are widely employed in peptide classification [11], the variety of models used in a broad range of studies is large. For instance, Khatun et al. (2020) utilized several ML algorithms, including NaĂŻve Bayes, AdaBoost, and a fusion-based ensemble for the prediction of proinflammatory peptides [21]. The fusion-based model outperformed the other ML models significantly for this task [21]. Plisson et al. (2020) employed Decision Trees (DT) and Gradient Boosting (GB), among others, to classify non-hemolytic peptides and demonstrated that the GB ensemble has superior performance [22]. In contrast, Timmons et al. (2020) used Artificial Neural Networks to characterize therapeutic peptides with hemolytic activity [23]. Singh et al. (2021) compared several base classifiers, e.g., Linear Discriminant Analysis and ensemble methods, e.g., GB and Extra Trees to detect AMPs [24]. They demonstrated that the GB performed best [24]. These studies clearly show that ensemble classifiers typically show superior performance than single classifiers, owing to the fact that they can compensate for weaknesses of single encodings and base classifiers [25].

Recently, Chen et al. (2021) introduced a comprehensive tool, which allows less programming experienced researchers to simply select encodings and base or ensemble classifiers through a graphical user interface, allowing easy access to the underlying algorithms [26]. Nevertheless, the approach assumes that the user selects proper settings for the parameterized encodings, which has been previously shown to affect the classification process significantly [20]. Moreover, the encoding selection is independent of the classifier settings, meaning that the tool can set up the classifier automatically; however, the encoding selection is not part of it. Thus, it remains a challenge to pick good encodings and classifiers for a biomedical classification task at hand. To this end, we assessed unsupervised encoding selection and the performance and diversity of multiple ensemble methods. We added different overproduce-and-select techniques for ensemble pruning, facilitating an automatic ensemble generation. In addition, we utilized Decision Trees, Logistic Regression, and NaĂŻve Bayes as base classifiers, owing to their prevalence in the field of biomedical classification [11, 19, 27].

Besides demonstrating the benefit of an unsupervised encoding selection, we also examined how the RF performs as a base and ensemble classifier. Specifically, we examined whether the RF, also an ensemble method, is performance-wise already saturated or whether a subsequent fusion can improve the final predictions. Fusion of RFs has been shown in other studies to improve overall performance, e.g., for HIV tropism predictions [28, 29]. All in all, we complement our recent large-scale study on peptide encodings [20] with an automatic encoding selection and a performance analysis of multiple base and ensemble classifiers. Ultimately, the present research bridges the gap between many peptide encodings and available machine learning models.

## Results

We developed an end-to-end workflow, which automatically generates and assesses classifier ensembles using different pruning methods and a variety of encoded datasets from multiple biomedical domains (see Table 3). Researchers can easily extend the workflow with different base and ensemble classifiers, pruning methods, and encoded datasets. The results can be reviewed using the provided visualizations, and the performance is further revised using multiple statistics. We demonstrate that the Pareto frontier pruning is a valuable technique to generate efficient classifier ensembles. However, the utilized base classifiers show comparable performance. We address the results in more detail in the following. We use the example of the avp_amppred dataset throughout the manuscript. The results for the remaining datasets can be found in the supplement. Moreover, the code is publicly available at https://github.com/spaenigs/ensemble-performance. The workflow produces interactive versions of all charts (see supplements).

### Pruning methods

All pruning methods generate ensembles, i.e., combined encodings, superior to the single classifiers (see Tables 1 and 2). In the case of the Pareto frontier (pfront) pruning, which is mainly ranked among the best pruning methods, we observe a significant (\(p\le 0.001\)) performance improvement compared to the the single classifiers. We also observed that the pfront pruning generates larger ensembles than the convex hull (chull) pruning, which can be visually verified in Fig. 4 (red line). Notably, including the Random Forest (RF) classifier (see Table 1, pfront) does not, or very slightly, affect the ensemble performance without RF (see Table 2), although the single classifier performance is improved with the RF included (see Table 1). Consequently, the RF increases the overall performance of the ensembles generated by the best encodings pruning.

The multi-verse optimization (MVO) suffers from high computational demand, i.e., a long pruning time, however, demonstrates at least comparable performance (see Tables 1, 2 and 3). The fitness of the models increases rapidly in the first generations (see Fig. 1). However, towards the last generations the curve flattens. The RF-based models show already higher fitness in the beginning, only increasing slightly in the course of the generations. The robustness of the RF classifier is also highlighed by the low standard deviation across the folds (see Fig. 1).

### Ensemble classifiers

The ensemble performance mainly depends on the pruning and the choice of the base classifiers; hence, using individual encodings. The performance between the best, random (rand), and chull pruning is insignificant, which stresses the effectiveness of the Pareto frontier (pfront) pruning (see Fig. 2). Furthermore, no significant difference can be observed for ensembles with the same base classifiers, e.g., the RF or Decision Tree (DT). Thus, the fusion method impacts the overall performance only slightly. However, various base classifiers result in significant different ensembles, i.e., employing, for instance, the RF, generates significantly different ensembles compared to the application of other base classifiers (see Fig. 2).

Moreover, we observed that the NaĂŻve Bayes (NB), the Logistic Regression (LR), and the Multi-Layer Perceptron (MLP) classifiers result in ensembles with higher variance (see Fig. 4). In contrast, the area covered by RF and DT models is more compact. Therefore, the variables, i.e., diversity and the pairwise error, are revised by a multivariate analysis of variance (MANOVA), which revealed a significant difference (\(p<0.001\)). A separate examination of the variables utilizing variance analysis (ANOVA) followed by a post-hoc analysis using Tukeyâ€™s HSD, demonstrates that all variables are significantly different (\(p<0.001\)). Finally, we conducted an ANOVA on the particular area values, which disproves the initial observation, i.e., all areas are significantly different (\(p<0.001\)). However, considering the average values for all datasets, the DT and RF are commonly ranked as the base classifiers with low variance (see Table 4).

### Single classifiers

In general, the performance of the individual classifiers (single) is lower compared to the classifier ensembles (see Fig. 3). In addition, we noticed that the RF is relatively saturated, i.e., using the RF as a single classifier and as a base model for ensembles does not have a significant effect on performance improvement. The low-performance variance is in line with the observation that weak models benefit most from ensemble learning; however, RFs are ensemble models [33, 34]. In contrast, the performance of other single classifiers revealed more distinct differences to the ensembles (see Fig. 3 and Tables 1 and 2).

### Data visualization

We leveraged two standard visualization techniques, which we adapted and extended for our particular application. First, we enhanced the kappa-error diagram [25] for the presentation of multiple folds, i.e., 100 in the current study, by aggregating the cross-validation results into a two-dimensional histogram (see Fig. 4). The color code allows the viewer to spot the peak at one glance. Hence, the tendency of ensembles to use a specific base classifier. Moreover, considering the distribution of the variables, one can make conclusions about the robustness.

Second, we extended the critical difference (CD) chart [35] with a categorical heatmap displaying the actual performance. The extension enables viewers to compare classifiers and review the individual encoding performance, i.e., Matthews correlation coefficient in the present case, at one glance. In addition, the thickness of the vertical and horizontal rules is directly related to the critical difference, i.e., the thicker the rule, the closer the classifiers to the critical difference. Thus, the rule thickness provides an additional visual channel to access the CD.

## Discussion

We developed a workflow for unsupervised encoding selection and performance assessment of multiple ensembles and base classifiers. Thus, we implemented and compared several algorithms to facilitate ensemble pruning, including convex hull, Pareto frontier pruning, and multi-verse optimization (MVO). Our results demonstrate that the crucial factors are the base classifiers and the individual encodings. The ensemble technique was not relevant, i.e., we could not observe performance variations using one of hard or soft voting or stacking. In general, applying the Random Forest (RF) or Multi-Layer Perceptron (MLP) as a base classifier yielded good performance across all datasets. The Pareto frontier pruning selected suitable encodings throughout the experiments. In addition, we observed similar performance as reported in the original studies (see Tables 1 and 3).

However, since we used one encoding per base classifier, we restricted the employed ensemble methods, i.e., majority voting, averaging, and stacking, which do not modify the base classifiers. These ensemble types are in contrast to others, e.g., boosting, where weights are adapted for misclassified training instances in base classifiers [36]. More research is necessary to investigate how performance and more sophisticated ensemble methods are associated. The employed ensemble types are also the reason for the kappa-error point cloud shape solely depending on the base classifiers. Consequently, computing the kappa-error diagram for all ensemble methods was not necessary. Our encoding/classifier approach is also contrary to other studies, e.g., [12, 14], or [16], which concatenated several encoded datasets to one final dataset (hybrid model) and applied feature selection before training. In the present study, we solely scaled the datasets to standardize the feature range, and used the encoded datasets largely unprocessed, potentially affecting the final performance. Furthermore, we neglected hyperparameter optimization of the base classifiers and focused on encoding, pruning and model selection. In this respect, more research is necessary to investigate the impact of hyperparameter optimization of the base models on the ensemble performance. However, since our tool determines important components of effective ensembles, researchers can subsequently improve the hyperparamters, if necessary.

In general, utilizing the Pareto frontier pruning generates good ensembles; however, requiring the calculation of the Cartesian product of all base classifiers; thus, encodings. Although only the (lower) triangular matrix is necessary, the computation is still CPU-intensive.

Another point of concern refers to the selection of exactly 15 classifier pairs per fold for the random and best encoding ensembles. In contrast, the size of classifier the Pareto frontier and convex hull pruning ensembles depend on the number of classifier pairs on the Pareto front and convex hull, respectively. Thus, a fair comparison is difficult. However, we repeated the experiments multiple times using a 100-fold MCCV, which rules out any selection bias. Considering the performance between individual models and ensembles, the contribution of the diversity is only slight, however, more research is required in this direction.

The results of the MVO also acknowledge the impact of diversity. One can observe that the MVO generates inferior ensembles (see Fig. 3). However, due to the limited number of generations, the MVO could be stuck in local optimum (see Fig. 1). More research is also required to examine specific advantages and disadvantages of the MVO for encoding selection [37].

In the kappa-error chart, which depicts preferable classifier pairs towards the lower-left corner, one can readily recognize the inferiority of the MVO (see Fig. 4). The classifier pairs are distributed across the kappa-error area, i.e., the MVO screens the entire solution space and adds weak classifiers to the final ensemble. Nevertheless, since we limited the maximum number of generations to 15, we cannot rule out that more generations would yield better results. Moreover, due to high resource consumption, we limited the MVO to 5 folds, which hampers the comparison.

Moreover, the Random Forest (RF) deployment as a single classifier reveals good performance, which is not surprising since it is already an ensemble algorithm. In this respect, the other base classifiers are less accurate (see Fig. 3). However, it could be demonstrated that RFs as base classifiers, i.e., using different encoded datasets per model, slightly improves the performance. This further highlights the importance of different encodings, hence the projection of different biological aspects, for the classification process.

The implemented methods demonstrate usability on a broad range of datasets from various biomedical domains. We incorporated the MVO owing to its good performance on several benchmark datasets [38]. The comprehensive Monte Carlo cross-validation copes with the variance, ultimately increasing the robustness of the results. In addition, the Pareto frontier and convex hull pruning consider simultaneously the performance and the diversity of encodings and base classifiers; hence, compensating their strength and weaknesses and revealing their potential for ensembles [39]. Our proposed extension to the critical difference chart allows the viewer at one glance to grasp significant, i.e., critical, performance differences of encodings, models, and pruning methods jointly with the actual performance.

## Conclusions

In summary, we employed two overproduce-and-select methods, namely Pareto frontier and convex hull pruning, as well as the multi-verse optimizer for exhaustively searching the encoding/base classifier space. We employed Logistic Regression, Decision Trees, NaĂŻve Bayes, Multi-Layer Perceptron, and Random Forest as base models and majority vote, averaging, and stacked generalization for the fusion. The experiments and visualizations enable the comparison of the respective components; however, further research is necessary to examine other ensemble classifiers, e.g., boosting. We considered ten datasets as case studies.

All in all, we propose an extensible workflow for automated encoding selection through diverse ensemble pruning methods. Depending on the use-case, i.e., the initial encoding space, our workflow can be used stand-alone, or as an extension for our recent work [20], ultimately easing the access for non-technical users. In case of the avp_amppred dataset, we observed good performance using the Pareto frontier pruning. Our workflow determines the best encodings, base models, as well as pruning, and fusion method for the classification task at hand. This allows researchers to built upon an effective model and focus on hyperparameter optimization afterward.

## Methods

We developed a workflow using Snakemake v6.5.1 [40], Python v3.9.1, and R v4.1.0. For the machine learning algorithms, we employed scikit-learn v0.24.2 [41]. We will use the following definitions throughout the manuscript: the original unprocessed dataset is denoted as the dataset. One dataset can be encoded in manifold ways, which we refer to as encoded datasets. Finally, encodings specify particular encoding algorithms.

The peptide datasets are taken from the PEPTIDE REACToR [20]. We also utilized this tool to reduce the initial set of encoded datasets. Ultimately, the goal of our workflow is the selection of the best encodings from the reduced set of encoded datasets. Note that best encodings refers to datasets, encoded with specific, i.e., effective, encoding algorithms. Afterward, the encoded datasets are used to train the base classifiers.

There are two approaches to harness multiple encodings in a single model, namely the fusion and the hybrid model [21]. Fusion models train one encoding per base classifier and fuse the output for the final prediction. Contrary, hybrid models use the concatenated features of multiple encodings for single model training. The concatenation approach is particularly problematic for entropy-based models such as DT or RF due to the bias in variable selection. Thus, in the present study, we implemented the fusion design, i.e., each ensemble consists of an arbitrary amount of base classifiers using one particular encoding, respectively. Finally, the employed datasets from a wide range of biomedical domains ensure broad applicability and the robustness of our results.

The workflow conducts the following steps. First, train, validation, and test indices are calculated to ensure equal samples for the cross-validation. More precisely, we computed the indices before the experiments such that all base and ensemble models use the same train, validation, and test data in each fold. Second, we standardized the encoded datasets using a min-max normalization between 0 and 1. Afterward, we trained and assessed models for all encoded datasets and ensemble types using a 100-fold Monte Carlo cross-validation. Besides the pruning algorithms, we selected the best and random encodings and compare the results to individual models. Finally, we statistically assessed and visualized the results (see Fig. 5). Significant steps are described in more detail below.

Note that we used Matthews correlation coefficient (MCC) throughout the study to handle the imbalance in the datasets [42]:

*a* is the number of true positives, *d* is the number of true negatives, *b* is the number of false negatives, and *c* is the number of false positives.

### Datasets

For a comprehensive analysis on peptide encodings, SpĂ¤nig et al. (2021) gathered and encoded a variety of datasets from multiple biomedical domains [20]. Details about the dataset preprocessing and encoding algorithms can be found in this publication [20]. We selected datasets with low to medium classification performance from this collection, i.e., a reported MCC of \(0.63\pm 0.15\) on the independent test set; additionally, covering diverse biomedical applications. Moreover, we excluded datasets for which accurate models have been published to investigate the potential effects of different classifiers and ensembles. We limited our study to ten datasets to cope with the computational complexity. The dataset size ranges from 492 to 3,228 sequences with an average of \(1,580.8\pm 812.1\) sequences. The datasets comprise 15,782 sequences with a mean length of \(21.17\pm 13.23\) amino acids. 6,404 sequences belong to the positive and 9,378 to the negative class. The average sequence length is \(22.47\pm 15.88\) and \(20.29\pm 10.97\), respectively. Duplicated sequences have been removed. Refer to Table 3Â for more details.

### Monte Carlo cross-validation

We applied a 100-fold Monte Carlo cross-validation (MCCV) [43]. The MCCV improves the generalization and diminishes the variance of the results, i.e., results are more robust, hence comparable. In addition, we ensured that the *n*-th fold is identical across all experiments leading to improved comparability across all base classifiers and ensembles. Each fold is composed of one split using 80 % of the data for training, 10 % for validation, and the remaining 10 % for testing (see Fig. 6). In contrast to *k*-fold cross-validation, MCCV follows a sampling with replacement strategy, i.e., splits can contain identical samples multiple times. However, duplicate samples do not occur in the train, validation, and the test split [43].

### Base classifiers

We used the following base classifiers for our experiments: NaĂŻve Bayes, Logistic Regression, Decision Tree, Multi-Layer Perceptron, and Random Forest. Each classifier will be briefly described hereinafter. We used the implementations provided by the scikit-learn library [41]. We utilized the default hyperparameters.

#### NaĂŻve Bayes

The NaĂŻve Bayes (NB) classifier (naively) assumes conditional independence of the feature vectors and applies the Bayes theorem for the prediction [25]. Model training is enabled via a probability density function (PDF) and the prior probability of a given class. For simplicity, we assume a Gaussian distribution of the features. Hence, we applied the Gaussian NB using

as the PDF, whereby \(\sigma\) denotes the standard deviation and \(\mu\) the mean of features *x* given a class *y* [44].

#### Logistic Regression

The binary Logistic Regression (LR) is another probability-based classifier, i.e., it derives the probability of a class *y* given a feature vector *x* [45]. The LR predicts probabilities between 0 and 1 using the logistic function denoted as

and the maximum likelihood function to estimate the coefficients \(\beta\), i.e., to train the model [45].

#### Decision Tree

The Decision Tree (DT) classifier, precisely the CART (Classification And Regression Trees) implementation, is a tree-based model, i.e., a tree structure is generated during training [46]. Each node is based on the most discriminating feature [25]. New splits are created based on the Gini impurity *i* of the remaining data:

where \(j\in \{0,1\}\) for binary classification and *P* is the probability of class *j* at a node *t* [25]. If a split is pure enough, a leaf node is added. Otherwise, intermediate nodes are created [25]. For prediction, the tree is traced until a leaf node, which states the final class.

#### Multi-Layer Perceptron

The Multi-Layer Perceptron (MLP) follows the functioning of neural networks in the human brain [25]. In particular, the MLP is an artificial neural network consisting of multiple layers of neurons, which are connected with a certain intensity, i.e., are weighted. Based on the input and the activation function of the neurons, the MLP propagate the input data forward and finally assigns the class in the last layer. To train the MLP, the data is backward propagated and the weights are updated accordingly to minimize the training error [25].

#### Random Forest

The Random Forest (RF) classifier is an ensemble learning technique, which trains multiple DTs on random samples, i.e., bagging, of the input data [47]. For the final classification, the majority vote of the trees is used [47]. Note that we use the RF as a base learner, which allows comparing the performance with DTs and the actual ensembles techniques in general (see below).

### Classifier ensembles

To combine the output of the base classifiers introduced above, we employed the following ensemble methods: majority vote (hard voting), averaging (soft voting), and stacked generalization (stacking). In the present study, each base classifier is trained on one encoded dataset, meaning if for one dataset *n* encodings are selected, the size of one ensemble is *n*. We adapted the implementations of the scikit-learn library [41], such that not only one dataset but several encoded datasets can be used for training. For instance, if one passes *n* encoded datasets, the ensemble consists of *n* base classifiers trained on one particular encoded dataset, respectively.

### Majority voting

The majority voting ensemble (hard voting) combines the output by ultimately assigning the class, which has been predicted by the majority of the single base classifiers. We employed the customized version of scikit-learnâ€™s VotingClassifier class with hard voting enabled.

### Averaging

The averaging method (soft voting) computes the means of the predicted class probabilities per base classifier. The maximum value determines the final class. We used the adjusted VotingClassifier with voting set to soft.

### Stacked generalization

The stacking approach utilizes the output of the base classifiers to train a meta-model, i.e., the predicted class probabilities of the base classifiers are used as features [48]. We adapted the StackingClassifier from the scikit-learn package and employed Logistic Regression as the meta-model.

### Ensemble pruning

Selecting the correct number of base classifiers in an ensemble is challenging. Thus, Kuncheva (2014) suggests several approaches to determine the ensemble size [25]. For instance, sequential forward selection, adding one classifier successively, in case the additional model improves the ensemble performance [25]. However, in the present case, we are dealing with potentially hundreds of encoded datasets, for which this particular technique is not practical. To this end, we used two selection methods, namely convex hull and Pareto frontier pruning, which handle the complexity better [25]. Moreover, we implemented the multi-verse optimization algorithm as an additional encoding selection technique [49]. Finally, we employed best and random encodings selection as a reference. The pruning methods are described more detailed in the following.

#### Kappa-error diagram

The kappa-error diagram, introduced by Margineantu and Dietterich (1997), is the basis for the convex hull and Pareto frontier pruning [50]. The graph represents pairs of classifiers by their average error and diversity, as shown in Fig. 4. The diversity measures the agreement of classifier outputs, i.e., the better the agreement of the classifier predictions, the less the diversity [25]. Specifically, the kappa diversity is denoted as

The \(\kappa\) statistic ranges from \(-1\) to 1, whereby \(\kappa =1\) denotes perfect agreement, \(\kappa =0\) random, and \(\kappa <0\) worse than random consensus [50]. The error is calculated using

with the subtrahend being the accuracy. However, Kuncheva (2013) pointed out that diversity concerning the average error can not be arbitrarily low [39]. In fact, desirable classifier pairs approximate the lower-left corner (see Fig. 4), i.e., approximating a theoretical boundary, which is defined in Eq. 7 [39].

Note that the classifier pairs are composed using the lower triangular matrix of the Cartesian product. Afterward, the pruning methods select a subset of pairs, also likely include duplicated base classifiers. Thus, all pruning methods ensure that the final ensemble only uses unique classifiers. Hence, base classifiers are trained on individual encoded datasets.

#### Convex hull

The kappa-error diagram depicts a set of points, i.e., pairs of base classifiers, in a two-dimensional space. The kappa diversity is the first, and the pairwise average error is the second dimension. We employed the Quickhull algorithm to calculate the convex hull [51]. Hence, the smallest convex set that contains the classifier pairs [51]. Thus, no further classifier pairs exist beyond the convex hull. We utilized the implementation of the Quickhull algorithm provided by the SciPy package in the ConvexHull module [52].

Since we are only interested in the partial convex hull, that is, pairs approaching the theoretical boundary defined in Eq. 7 and depicted in Fig. 4, we adapted the *pareto_n* algorithm from Kuncheva (2014), which returns only classifier pairs fulfilling the criteria [25].

#### Pareto frontier

The Pareto optimality describes the compromise of multiple properties towards optimizing a single objective [53]. For instance, a pair of classifiers is Pareto optimal if improving the diversity is impossible without simultaneously impairing the average pairwise error. Analog to the partial convex hull introduced earlier, Pareto optimal classifier pairs approach the theoretical boundary as stated in Eq. 7, ultimately defining the Pareto frontier. Again, we used the *pareto_n* algorithm adapted from Kuncheva (2014) to obtain all classifier pairs determining the Pareto frontier (see Fig. 4).

#### Multi-verse optimization

The multi-verse optimization (MVO) algorithm is inspired by the alternative cosmological model stating that several big bangs created multiple, parallel existing universes, which are connected by black and white holes and wormholes [38]. In terms of an optimization algorithm, black and white holes are used to explore the search space and wormholes to refine solutions [38]. Moreover, the inflation rate, i.e., the fitness, of universes is used for the emergence of new holes; thus, to cope with local minima [49]. We selected the MVO owing to its superior performance compared to other state-af-the-art optimization algorithms, for instance, Particle Swarm Optimization [49]. An outstanding property of the MVO is the efficiency of searching and optimization [49]. This property is beneficial for the selection process, since the parameterized encoding groups form a large search space [20]. For more details, refer to Mirjalili et al. (2016) and Al-Madi et al. (2019) [38, 49].

We implemented the binary MVO following [49] using Python. Each solution candidate is represented as a binary vector, where each position denotes the path to an encoded dataset, that is, the *i*-th bit set means that the *i*-th encoding is included in the final ensemble (see Fig. 4). We examined different generations, i.e., 100, 80, 50, 25, and 15. However, we observed that performance depends mainly on the initialization and count of the universes. Specifically, the performance gain from the 15th generation is minor but requires much time. Thus, we set the optimization to a maximum of 15 generations with 32 universes each. Due to its resource intensity, we executed the MVO only for the first five folds (see section Monte Carlo cross-validation).

#### Best encodings

A further pruning method uses only the best classifier pairs. In particular, based on the kappa-error diagram, the algorithm selects 15 classifier pairs with the lowest pairwise average error (see Fig. 4).

#### Random encodings

The last pruning method selects 15 random classifier pairs from the kappa-error diagram. Note that the selection is only performed one time. That is, the pairs are the same across all folds.

### Statistics

We examined the areas covered by the respective base classifiers (see Fig. 4). To this end, we calculated the area for each fold. The area is described by multiple variables, i.e., the kappa diversity and the average pairwise error. Thus, we applied the multivariate analysis of variance (MANOVA) to verify if the areas differ significantly. If this is the case, we subsequently employed an analysis of variance (ANOVA) to investigate the effect of the diversity and the average error separated. For post-hoc assessment, Tukeyâ€™s HSD has been applied. We used the tests provided by the R standard library. \(\alpha\) was set to 0.05, i.e., *p* values \(\le 0.05\) are considered as significant.

In addition, we employed the Friedman test with the Iman and Davenport correction for the statistical comparison of multiple single and ensemble classifiers [54]. In the case at least one model is significantly different, we used the Nemenyi test for post-hoc analysis [54]. Refer also to SpĂ¤nig et al. (2021) for more details [20]. The tests were provided by the scmamp R package v0.2.55 [35].

Finally, we examined if the best ensemble has a significant improvement over the best single classifier using Studentâ€™s *t*-test for repeated measures, i.e., paired samples. Again, \(\alpha\) was defined as 0.05.

### Data visualization

All plots are realized using Altair v4.1.0 [55] and described in more detail hereinafter.

#### Kappa-error diagram

The kappa-error diagram, suggested by Margineantu and Dietterich (1997) [50], shows the result of a single split in the left column and a two-dimensional histogram aggregating all folds in the right column (see Fig. 4). The rows show the base classifiers. Note that the kappa-error shape depends only on the base classifiers (see Discussion). The left column also visualizes the partial convex hull (black line) and the Pareto frontier (red line). The colors refer to the pruning method. Each dot is a classifier pair trained on two encoded datasets. Note that we display only 1000 dots per panel. Moreover, we set the bin size to 40 for the binned heatmap with darker colors depicting more values.

#### XCD chart

The extended critical difference (XCD) chart (Fig. 2) is based on the critical difference chart introduced by Calvo and SantafĂ© (2016) [35]. Classifier groups not surpassing the critical difference (CD) are connected with black lines. The line thickness depicts the actual CD, meaning groups associated with thicker lines are closer to CD. The XCD charts present two classifier groups. The x-axis includes pruning types, and the y-axis the actual ensembles and the corresponding base classifier. The main area contains a categorical heatmap showing Matthews correlation coefficient (MCC) in 0.05 steps. The darker, the higher the MCC. The MCC is the median MCC of the respective group combination and corresponds to the median from Fig. 3. Note that for the computation of the CD, we concatenated the MCCs of all cross-validation runs, e.g., 12 * 100 MCCs for pfront, and 6 * 100 MCCs for bayes_voting_soft.

## Availability of data and materials

The source code can be found at https://github.com/spaenigs/ensemble-performance. All datasets are available at https://github.com/spaenigs/peptidereactor.

## References

Holmes AH, Moore LSP, Sundsfjord A, Steinbakk M, Regmi S, Karkey A, et al. Understanding the mechanisms and drivers of antimicrobial resistance. Lancet. 2016;387(10014):176â€“87. https://doi.org/10.1016/S0140-6736(15)00473-0.

SpĂ¤nig S, Eick L, Nuy JK, Beisser D, Ip M, Heider D, et al. A multi-omics study on quantifying antimicrobial resistance in European freshwater lakes. Environ Int. 2021;157: 106821. https://doi.org/10.1016/j.envint.2021.106821.

Kakkar M, Walia K, Vong S, Chatterjee P, Sharma A. Antibiotic resistance and its containment in India. BMJ (Online). 2017;358:25â€“30. https://doi.org/10.1136/bmj.j2687.

Qu J, Huang Y, Lv X. Crisis of antimicrobial resistance in China: Now and the future. Front Microbiol. 2019;10(SEP). https://doi.org/10.3389/fmicb.2019.02240.

Lazzaro BP, Zasloff M, Rolff J. Antimicrobial peptides: Application informed by evolution. Science. 2020;368(6490). https://doi.org/10.1126/science.aau5480.

Magana M, Pushpanathan M, Santos AL, Leanse L, Fernandez M, Ioannidis A, et al. The value of antimicrobial peptides in the age of resistance. Lancet Infect Dis. 2020;20(9):e216â€“30. https://doi.org/10.1016/S1473-3099(20)30327-3.

Waghu FH, Idicula-Thomas S. Collection of antimicrobial peptides database and its derivatives: Applications and beyond. Protein Sci. 2020;29(1):36â€“42. https://doi.org/10.1002/pro.3714.

Chung CR, Kuo TR, Wu LC, Lee TY, Horng JT. Characterization and identification of antimicrobial peptides with different functional activities. Brief Bioinform. 2020;21(3):1098â€“114. https://doi.org/10.1093/bib/bbz043.

Dean SN, Walper SA. Variational autoencoder for generation of antimicrobial peptides. ACS Omega. 2020;5(33):20746â€“54. https://doi.org/10.1021/acsomega.0c00442.

Fingerhut LCHW, Miller DJ, Strugnell JM, Daly NL, Cooke IR. ampir: an R package for fast genome-wide prediction of antimicrobial peptides. Bioinformatics. 2020;36(21):5262â€“3. https://doi.org/10.1093/bioinformatics/btaa653/5873588.

Aronica PGA, Reid LM, Desai N, Li J, Fox SJ, Yadahalli S, etÂ al. Computational Methods and Tools in Antimicrobial Peptide Research. J Chem Inf Model. 2021;1â€“00175. https://doi.org/10.1021/acs.jcim.1c00175.

Manavalan B, Basith S, Shin TH, Choi S, Kim MO, Lee G. MLACP: machine-learning-based prediction of anticancer peptides. Oncotarget. 2017;8(44):77121â€“36.

Manavalan B, Subramaniyam S, Shin TH, Kim MO, Lee G. Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. J Proteome Res. 2018;17(8):2715â€“26. https://doi.org/10.1021/acs.jproteome.8b00148.

Gupta S, Sharma AK, Shastri V, Madhu MK, Sharma VK. Prediction of anti-inflammatory proteins/peptides: An insilico approach. J Transl Med. 2017;15(1). https://doi.org/10.1186/s12967-016-1103-6.

Manavalan B, Shin TH, Kim MO, Lee G. PIP-EL: A New Ensemble Learning Method for Improved Proinflammatory Peptide Predictions. Front Immunol. 2018;9:1783. https://doi.org/10.3389/fimmu.2018.01783.

Meher PK, Sahu TK, Saini V, Rao AR. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chouâ€™s general PseAAC. Sci Rep. 2017;7. https://doi.org/10.1038/srep42362.

Chaudhary K, Kumar R, Singh S, Tuknait A, Gautam A, Mathur D, etÂ al. A web server and mobile app for computing hemolytic potency of peptides. Sci Rep. 2016;6. https://doi.org/10.1038/srep22843.

Agrawal P, Kumar S, Singh A, Raghava GPS, Singh IK. NeuroPIpred: a tool to predict, design and scan insect neuropeptides. Sci Rep. 2019;9(1). https://doi.org/10.1038/s41598-019-41538-x.

SpĂ¤nig S, Heider D. Encodings and models for antimicrobial peptide classification for multi-resistant pathogens. BioData Min. 2019;12(1):1â€“29. https://doi.org/10.1186/s13040-019-0196-x.

SpĂ¤nig S, Mohsen S, Hattab G, Hauschild AC, Heider D. A large-scale comparative study on peptide encodings for biomedical classification. NAR Genomics Bioinforma. 2021;3(2). https://doi.org/10.1093/nargab/lqab039.

Khatun MS, Hasan MM, Shoombuatong W, Kurata H. ProIn-Fuse: improved and robust prediction of proinflammatory peptides by fusing of multiple feature representations. J Comput-Aided Mol Des. 2020;34(12):1229â€“36. https://doi.org/10.1007/s10822-020-00343-9.

Plisson F, RamĂrez-SĂˇnchez O, MartĂnez-HernĂˇndez C. Machine learning-guided discovery and design of non-hemolytic peptides. Sci Rep. 2020;10(1). https://doi.org/10.1038/s41598-020-73644-6.

Timmons PB, Hewage CM. HAPPENN is a novel tool for hemolytic activity prediction for therapeutic peptides which employs neural networks. Sci Rep. 2020;10(1). https://doi.org/10.1038/s41598-020-67701-3.

Singh O, Hsu WL, Su ECY. Co-AMPpred for in silico-aided predictions of antimicrobial peptides by integrating composition-based features. BMC Bioinformatics. 2021;22(1):389. https://doi.org/10.1186/s12859-021-04305-2.

Kuncheva LI. Combining Pattern Classifiers. 2014. https://doi.org/10.1002/9781118914564.

Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021. https://doi.org/10.1093/nar/gkab122.

Schwarz J, Heider D. GUESS: projecting machine learning scores to well-calibrated probability estimates for clinical decision-making. Bioinformatics. 2019;35(14):2458â€“65. https://doi.org/10.1093/bioinformatics/bty984.

Heider D, Dybowski JN, Wilms C, Hoffmann D. A simple structure-based model for the prediction of HIV-1 co-receptor tropism. BioData Min. 2014;7(14).

LĂ¶chel HF, Riemenschneider M, Frishman D, Heider D. SCOTCH: subtype A coreceptor tropism classification in HIV-1. Bioinformatics. 2018;34(15):2575â€“80.

Su X, Xu J, Yin Y, Quan X, Zhang H. Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinformatics. 2019;20(1). https://doi.org/10.1186/s12859-019-3327-y.

Usmani SS, Bhalla S, Raghava GPS. Prediction of antitubercular peptides from sequence information using ensemble classifier and hybrid features. Front Pharmacol. 2018;9(AUG):1â€“11. https://doi.org/10.3389/fphar.2018.00954.

Nagpal G, Usmani SS, Dhanda SK, Kaur H, Singh S, Sharma M, etÂ al. Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential. Sci Rep. 2017;7. https://doi.org/10.1038/srep42851.

Kuncheva LI, Jain LC. Designing Classifier Fusion Systems by Genetic Algorithms. In: IEEE Transactions on evolutionary computation, vol.Â 4. 2000.

Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, et al. Data mining in the life science swith random forest: A walk in the park or lost in the jungle? Brief Bioinform. 2013;14:315â€“26. https://doi.org/10.1093/bib/bbs034.

Calvo B, SantafĂ© G. scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems. R J. 2016;8(1):248â€“56.

Zhu J, Zou H, Rosset S, Hastie T. Multi-class AdaBoost. Stat. Interface. 2009;2:349â€“60.

Liu L, Chen X, Wong KC. Early cancer detection from genome-wide cell-free DNA fragmentation via shuffled frog leaping algorithm and support vector machine. Bioinformatics. 2021;37(19):3099â€“105. https://doi.org/10.1093/bioinformatics/btab236.

Mirjalili S, Mirjalili SM, Hatamlou A. Multi-Verse Optimizer: a nature-inspired algorithm for global optimization. Neural Comput Applic. 2016;27(2):495â€“513. https://doi.org/10.1007/s00521-015-1870-7.

Kuncheva LI. A bound on kappa-error diagrams for analysis of classifier ensembles. IEEE Trans Knowl Data Eng. 2013;25(3):494â€“501. https://doi.org/10.1109/TKDE.2011.234.

KĂ¶ster J, Rahmann S. Snakemake-a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520â€“2. https://doi.org/10.1093/bioinformatics/bts480.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2015;12:2825â€“30. https://doi.org/10.1145/2786984.2786995.

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1). https://doi.org/10.1186/s12864-019-6413-7.

Xu QS, Liang YZ. Monte Carlo cross validation. Chemometr Intell Lab Syst. 2000;56:1â€“11.

Ren J, Lee SD, Chen X, Kao B, Cheng R, Cheung D. Naive bayes classification of uncertain data. In: Proceedings - IEEE International Conference on Data Mining. ICDM; 2009. p. 944â€“949. https://doi.org/10.1109/ICDM.2009.90.

James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning. 8th ed. New York: Springer; 2017.

Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification And Regression Trees. London: Routledge; 1984. https://doi.org/10.1201/9781315139470.

Breiman L. Random Forests. Mach Learn. 2001;45:5â€“32.

Wolpert DH. Stacked Generalization. Neural Netw. 1992;5:241â€“59.

Al-Madi N, Faris H, Mirjalili S. Binary multi-verse optimization algorithm for global optimization and discrete problems. Int J Mach Learn Cybern. 2019;10(12):3445â€“65. https://doi.org/10.1007/s13042-019-00931-8.

Margineantu DD, Dietterich TG. Pruning Adaptive Boosting. In: ICML. 1997;97:211â€“8.

Barber CB, Dobkin DP, Huhdanpaa H. The Quickhull Algorithm for Convex Hull. ACM Trans Math Softw. 1996;22(4):469â€“83.

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, etÂ al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat Methods. 2020;17:261â€“272. https://doi.org/10.1038/s41592-019-0686-2.

Messac A, Ismail-Yahaya A, Mattson CA. The normalized normal constraint method for generating the Pareto frontier. Struct Multidiscip Optim. 2003;25(2):86â€“98. https://doi.org/10.1007/s00158-002-0276-1.

Santafe G, Inza I, Lozano JA. Dealing with the evaluation of supervised classification algorithms. Artif Intell Rev. 2015;44(4):467â€“508. https://doi.org/10.1007/s10462-015-9433-y.

VanderPlas J, Granger B, Heer J, Moritz D, Wongsuphasawat K, Satyanarayan A, et al. Altair: Interactive Statistical Visualizations for Python. J Open Source Softw. 2018;3(32):1057. https://doi.org/10.21105/joss.01057.

## Funding

Open Access funding enabled and organized by Projekt DEAL. This work was financially supported by the BMWi in the project MoDiPro-ISOB (16KN0742325). This work was also supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A).

## Author information

### Authors and Affiliations

### Contributions

SS and DH developed the concept. SS designed and performed the experiments and analyzed the data. SS and DH interpreted the results. AM implemented the MVO algorithm. SS wrote the manuscript. DH supervised the study and revised the manuscript. All authors read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

## Additional information

### Publisherâ€™s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

## About this article

### Cite this article

SpĂ¤nig, S., Michel, A. & Heider, D. Unsupervised encoding selection through ensemble pruning for biomedical classification.
*BioData Mining* **16**, 10 (2023). https://doi.org/10.1186/s13040-022-00317-7

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/s13040-022-00317-7