The term benchmarking is used in machine learning (ML) to refer to the evaluation and comparison of ML methods regarding their ability to learn patterns in ‘benchmark’ datasets that have been applied as ‘standards’. Benchmarking could be thought of simply as a sanity check to confirm that a new method successfully runs as expected and can reliably find simple patterns that existing methods are known to identify [1]. A more rigorous way to view benchmarking is as an approach to identify the respective strengths and weaknesses of a given methodology in contrast with others [2]. Comparisons could be made over a range of evaluation metrics, e.g., power to detect signal, prediction accuracy, computational complexity, and model interpretability. This approach to benchmarking would be important for demonstrating new methodological abilities or simply to guide the selection of an appropriate ML method for a given problem.
Benchmark datasets typically take one of three forms. The first is accessible, well-studied real-world data, taken from different real-world problem domains of interest. The second is simulated data, or data that has been artificially generated, often to ‘look’ like real-world data, but with known, underlying patterns. For example, the GAMETES genetic-data simulation software generates epistatic patterns of association in ‘mock’ single nucleotide polymorphism (SNP) data [3, 4]. The third form is toy data, which we will define here as data that is also artificially generated with a known embedded pattern but without an emphasis on representing real-world data, e.g., the parity or multiplexer problems [5, 6]. It is worth noting that the term ‘toy dataset’ has often been used to describe a small and simple dataset such as the examples included with algorithm software.
While some benchmark repositories and datasets have emerged as more popular than others, ML still lacks a central, comprehensive, and concise set of benchmark datasets that accentuate the strengths and weaknesses of established ML methods. Individual studies often restrict their benchmarking efforts for various reasons, for example based on comparing variants of the ML algorithm of interest. The genetic programming (GP) community has also previously discussed appropriate benchmarking when comparing GP methodologies [7–9]. Benchmarking efforts may focus on a specific application of interest, e.g. traffic sign detection [10], or a more narrowly defined ML problem type, e.g. classification of 2-way epistatic interactions [11, 12]. The scope of benchmarking may also be limited by practical computational requirements.
There are currently a number of challenges that make it difficult to benchmark ML methods in a useful and globally accepted manner. For one, there are an overwhelming number of publications that reference the use of benchmark datasets, however there are surprisingly few publications that discuss the topic of appropriate ML benchmarking in general. Additionally, collecting and curating real-world benchmark datasets remains a challenge for many researchers [13]. Although repositories such as the UCI ML repository [14] and Kaggle [15] provide dozens of real-world datasets to download for free, these datasets come in myriad formats and require considerable preprocessing before ML methods can be applied to them. As a result, many benchmark datasets go unused simply because they are too difficult to preprocess. In addition, repositories such as Kaggle and OpenML [16] focus on solving data science problems through collaboration, and are not designed with comprehensive ML benchmarking in mind. Further, while real-world benchmarks can be derived from many different problem domains, from a strict data science perspective, many of the benchmarks in repositories can have very similar meta-features (e.g. the number of instances, number of features, number of classes, presence of missing data, and similar signal to noise ratios, etc.), such that while they are representative of different real-world problems, they may not represent a diverse assembly of data science problems. This issue has been raised previously: when applying UCI datasets as benchmarks, it was noted that the scope of included datasets limited method evaluation, and suggested that repositories such as UCI should be expanded [13, 17, 18].
Another challenge in benchmarking is that researchers often use only a handful of datasets when evaluating their methods, which can make it difficult to properly compare one ML method to the state-of-the-art ML methods [13]. For example, these datasets may be handpicked to highlight the strengths of the proposed method, while failing to demonstrate the proposed method’s potential weaknesses. As a result, although a ML method may perform well on a handful of datasets, it may fail to generalize to a broader range of problems. We submit that it is just as important to clearly identify the limitations of an algorithm in benchmarking practices, something that is often overlooked. While there will always be a need to identify and generate custom benchmarks for new or specialized problem domains, e.g. physical activity monitoring data [19] or dynamical systems simulation [20], it is vital for the bioinformatics and ML community to have a comprehensive benchmark suite with which to compare and contrast ML methods. Towards this goal, the present study introduces the Penn Machine Learning Benchmark (PMLB), a publicly available dataset suite (accessibly hosted on GitHub) initialized with 165 real-world, simulated, and toy benchmark datasets for evaluating supervised classification methods. PMLB includes datasets from many of the most-used ML benchmark suites, such as KEEL [21] and the UCI ML repository [14]. In addition to collecting data from these resources, PMLB standardizes the format of these data and provides useful interfaces for fetching datasets directly from the web.
This initial PMLB repository is not meant to be comprehensive; it includes mainly real-world datasets and excludes regression datasets (i.e. those with a continuous-valued dependent variable), as well as any datasets with missing values. We have chosen to focus our initial assessment on available datasets in classification. This paper includes a high-level analysis of the properties (i.e. meta-features) of the founding PMLB datasets, such as feature counts, class imbalance, etc. Further, we evaluate the performance of 13 standard statistical ML methods from scikit-learn [22] over the full set of PMLB datasets. We then assess the diversity of these benchmark datasets from the perspective of their meta-features as well as based on the predictive performance over the set of ML methods applied. Beyond introducing a new simplified resource for ML benchmarks, this study was designed to provide insight into the limitations of currently utilized benchmarks, and direct the expansion and curation of a future improved PMLB dataset suite that more efficiently and comprehensively allows for the comparison of ML methods. This work provides another important step toward the assembly of a effective and diverse set of benchmarking standards integrating real-world, simulated, and toy datasets for generalized ML evaluation and comparison.