Benchmarking AutoML frameworks for disease prediction using medical claims

Objectives Ascertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets. Materials and Methods We generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics. Results The AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications. Discussion Healthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance. Conclusion Among the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application. Supplementary Information The online version contains supplementary material available at (10.1186/s13040-022-00300-2).


Background and significance
Leveraging big data growth in biomedicine and healthcare, machine learning (ML) has helped improve health outcomes, cut healthcare costs, and advance clinical research [1][2][3][4].Studies applying ML to healthcare data range from models for disease prediction or for improving quality of care, to applications such as detection of claim fraud [2,[5][6][7][8].Clinical big data used in various studies range from electronic health records, medical records, and claims data.Many studies are limited to a single healthcare or hospital system [9][10][11][12].
Despite the demonstrated benefits of machine learning, different models need to be trained in the context of the problem to achieve good performance [13].For each model, domain experts such as clinicians need to collaborate with data scientists to design ML pipelines [14].Automated machine learning (AutoML) is an emerging field [15] that aims to simplify this labor-intensive process [16] which can accelerate the integration of ML in healthcare scenarios [1].State-of-the-art AutoML platforms allow domain experts to design decently performing ML pipelines without deep knowledge of ML or statistics while at the same time easing the burden of tedious manual tasks such as model selection and hyperparameter optimization for data scientists [14].
With ML being adopted across industries, standardized benchmarks and datasets are needed to compare competing systems [17].These benchmark suites need to have datasets that highlight strengths and weaknesses of established ML methods [18].Despite the emergence of numerous AutoML tools, there is still a need for standardized benchmarks in the field.Multiple studies to benchmark various AutoML tools [14,[19][20][21] have been done.Notably, Gijsbers et al. [22] presented an open-source AutoML benchmark framework to provide objective feedback on the performance of different AutoML tools.Gijsbers et al. compared four AutoML tools across 39 public data sets, twenty-two of which are binary classifications, with a mixture of balanced and imbalanced data.Of these, only two have very low prevalence for one class, at around 1.8% each.Most of these studies on benchmarks tested public datasets that have sample sizes in the order 103 and feature sizes between 10-100.In contrast, our study uses a population of over 12 million and over 3,500 features.
Although different AutoML tools will perform differently depending on the problem, there is a need to have benchmarks on datasets that have similar characteristics to healthcare data.Highly imbalanced and large datasets are common in healthcare and thus, these benchmarks will prove useful for accelerating the model-building process by identifying a good baseline model.
A review of published papers for AutoML showed that despite the potential applications and demonstrated need [23], little work has been done in applying AutoML to the field of healthcare [7].Waring et al. determined the primary reasons for the lack of AutoML solutions for healthcare to be: (1) the lack of high-quality, representative, and diverse datasets, and (2) the inefficiency of current AutoML approaches for large datasets common in the biomedical environment.In particular, disease prediction problems often involve highly imbalanced datasets [24] which do not lend themselves well to predictive modelling.Disease prevalences are much lower than those of the public datasets used by Gjisbers et.al; the datasets we consider in this paper have positive prevalence ranging from 0.053% -0.63%.The extremely low prevalence does not give the models enough samples from one class to train on.

Objectives
To advance the use of AutoML tools in healthcare, there is a need to first assess their performance in representative datasets.Doing so brings to light the challenges and limitations of using these tools on healthcare data and serves as the basis for future improvements to better address problems in healthcare.In this study, we generated a dataset using claims data with 12.4M rows and 3.5k features.Using this, we compared the performance of different AutoML tools for predicting outcomes for different diseases of interest on datasets with high class imbalance.

Population
The population used in this analysis consisted of 12,425,832 people who were continuously enrolled in Medicare or Commercial plans from January 1, 2018 to December 31, 2019.Continuous enrollment in this period was required since the identification of the disease cohorts and the creation of features are heavily reliant on historic claims data.While it would have been ideal to ensure the completeness of each person's claims history, imposing a longer continuous enrollment criterion would have made fewer people eligible.Although features were created based on claims data from 2016 to 2018, completeness can only be guaranteed for data in 2018.
We aimed to predict if a person will have the first occurrence of a specific disease at any point from January 1, 2019 to December 31, 2019.Patients who had prior diagnoses of the target disease before the prediction time were excluded.For example, those that had a diabetes diagnosis prior to 2019 would be excluded from the cohort for which we are predicting diabetes.

Target diseases
We aimed to predict the occurrence of six diseases -lung cancer, prostate cancer, rheumatoid arthritis (RA), type 2 diabetes (T2D), inflammatory bowel disease (IBD), and chronic kidney disease (CKD) -in the prediction year.Claims-based definitions were created for each target disease.Table 1 gives definitions for each disease, along with the corresponding prevalence and cohort size, presented in order of increasing prevalence.Disease flags are based on the International Classification of Diseases, Tenth Revision (ICD-10).Since the presence of a given ICD-10 code in a claim may simply be due to an event such as a screening test being ordered rather than truly indicative of a diagnosis, we required the presence of that disease code in at least two claims within a specified time period for most of the diseases under consideration.The second occurrence of the ICD-10 code is considered the confirmatory diagnosis for most diseases.

Data creation
Features were derived from the administrative claims history of members from 2016 to 2018.Each claim corresponds to a patient visit and contains information that describes the healthcare services rendered such as diagnosis codes, procedure codes, medical supplies and equipment, and costs incurred.In this study, only the diagnosis codes were used as features.One claim can be associated with up to 12 diagnoses which corresponds to 12 unique ICD-10 codes, sequenced based on the severity of the illness.Only the first three diagnoses in each claim were considered to ensure that only the most clinically relevant diagnoses to the health service being availed were used.Other ICD-10 codes are coded primarily for billing purposes, and typically have very little to no relevance to the procedure or service.
Each diagnosis corresponds to an ICD-10 code, which can be up to 7 digits long.For each of the first three diagnoses, only the first three characters of the ICD-10 codes were used.The first three characters correspond to a broader classification of the diagnosis.For example, E10.2 corresponds to Type 1 Diabetes with kidney complications while E10.65 is for Type 1 diabetes with hyperglycemia.Taking only the first three characters, these two ICD-10 codes would fall under "Type 1 Diabetes".Using only the first three characters of the ICD-10 codes allows us to create adequately sized groups of patients that have the same disease.
For each claim in the patient's entire history from 2016 to 2018, the first three characters of the first three ICD-10 codes were taken.From these first three characters, indicator flags were created based on the presence or absence of these codes in four time periods of varying lengths.Thus, each code corresponds to four flags in our data set.Table 2 shows the time windows considered.
Binning diagnoses flags in different time windows was done to introduce a temporal component to the predictors.Older diagnoses were generally less relevant to the prediction of a disease.The presence of a particular diagnosis code in an earlier window does not guarantee that it will be present in the succeeding periods.Disease flags are only determined by the presence of a patient's claims with the relevant ICD codes related to a condition within time window, independently.Demographic information such as gender, state-level socioeconomic index, and age in 2018 were also used as features in the analysis.In total, 3,511 features were created.

Benchmark framework
The flowchart in Fig. 1 shows the framework used to benchmark the different AutoML systems adapted from [22] and modified to include a bootstrapping procedure to obtain 95% confidence intervals for each of the metrics considered.The features used for each model depended on the target outcome; flags corresponding to the ICD-10 code of the disease being predicted were excluded.For example, for lung cancer, all four features across different windows for the ICD code C34 were dropped.For each target disease, we generated a training set of 300,000 samples taken from the population of 12 million, maintaining the disease prevalence.The three AutoML tools (AutoSklearn [25], H2O [26] and TPOT [27]) and a random forest model were trained on the same training set for each disease.Random forests were used as a baseline in this study primarily because it was also used as the baseline model in the framework of Gijsbers et al. [22].In addition, random forests are good baseline models because they generate reasonable predictions without much parameter tuning, and can handle large numbers of inputs and features.Another difference between our framework and the reference framework is that for each AutoML model, we optimized for different metrics -average precision (area under precision-recall curve (AUCPR) approximation), balanced accuracy, and area under the receiver operating characteristic curve (ROC AUC).H2O was optimized for AUCPR and AUC, the latter corresponding to ROC AUC.We did not optimize H2O for balanced accuracy because this metric was not included in its base built-in scorers.This resulted in multiple models per target disease per tool instead of having a single model optimized for ROC AUC.The random forest model was considered the baseline for comparison.The default settings were used for each tool, except for the maximum run-time which we set at 48 hours for each model.All models were trained on identical 16-CPU 8-core Intel Xeon (2.3 GHz) machines with 256GB RAM.The trained models were then used to predict outcomes for the remaining holdout dataset consisting of 11.7 million samples.For each model and target disease, bootstrapping was performed on the predictions to obtain 95% confidence intervals for each model metric.Samples were taken with replacement (both stratified and not stratified) from the holdout validation set to obtain 500 sets of 150,000 observations  each.Metrics were then computed for the predictions of each model on each resampled dataset, yielding 500 values per metric per model which were used to derive the 95% confidence intervals.We note that, due to the large dataset size and consequent time and resource requirements, we ran each AutoML tool once for each choice of optimization metric, so these are confidence intervals for the performance on the holdout data for each of these specific AutoML runs.

Results
The bootstrapped metrics for the performance of the different models on the holdout set are shown in Figs. 2 and 3 for ROC AUC and AUCPR (the latter as approximated by the average precision), respectively.The same results can be seen in tabular form in Supplementary Tables 1 and 2, Additional File 1.
In both figures, diamond markers indicate the median metric scores for each model, while circle markers denote the lower and upper limits of the 95% confidence intervals calculated through bootstrapping.These figures show metrics computed using stratified bootstrap samples.There is minimal difference between the results of getting the metrics from either stratified or non-stratified bootstrap samples.The results for non-stratified samples can be seen in Figs. 1 and 2 in Supplementary File 1.For ROC AUC, we observe varying performances across different diseases.In general, no single AutoML framework outperforms the rest consistently and with a wide margin.Also, we observe that disease prevalence is not directly correlated to model performance; models with highest ROC AUC scores were those for prostate cancer which is the second least prevalent disease (0.12% prevalence).We also observe narrow confidence intervals for the models trained for predicting CKD, which has the highest prevalence.Wider confidence intervals correspond to lower disease prevalence, with the widest intervals observed for lung cancer (0.053% prevalence).Note that this is not always the case; for prostate cancer, all AutoSklearn and H2O models, and the TPOT model optimized for ROC AUC trained have relatively narrow confidence intervals.
Since model scores and performance varied across diseases, we normalize the median ROC AUC scores based on the median random forest model performance as done by Gjisbers et al.The results are shown in Table 3.The best performing models across diseases are either H2O models or the AutoSklearn model optimized for ROC AUC.However, for each disease the difference between the best model and the other models are small.In terms of ROC AUC improvements relative to the random forest models, greater improvements are observed for the less prevalent diseases.The median improvements for all AutoML models per disease are 1.136, 1.100, 1.083, 1.041, 1.078, and 1.036 for lung cancer, prostate cancer, rheumatoid arthritis, IBD, Type 2 Diabetes, and CKD, respectively.Due to the imbalance of the datasets, we also measure model performance on AUCPR.Low AUCPR scores are observed for all models as seen in Fig. 3.The models for prostate cancer which had narrow confidence intervals in terms of their ROC AUC scores have wider confidence intervals for their bootstrapped AUCPR scores.Generally, H2O models had the highest median AUCPR scores.Taking note of the range of AUCPR values, however, there is no single model that outperforms the rest significantly across different diseases.
Table 4 shows the performance increases of the models relative to the median baseline scores of the random forest model.Despite the low AUCPR scores, we generally observe improvements in AUCPR compared to the baseline models except for TPOT models optimized for balanced accuracy, especially those trained for predicting prostate and lung cancer.The median AUCPR improvements for all AutoML models per disease are 2.000, 1.567, 1.515, 1.224, 1.650, and 1.319 for lung cancer, prostate cancer, rheumatoid arthritis, IBD, Type 2 Diabetes, and CKD, respectively.
Beyond ROC AUC values, selecting the thresholds for each model is an essential step in evaluating a model for practical purposes.This is especially true when working with imbalanced data [28].Despite the AutoML output models being ready to generate hard predictions, in practice, one must still consider the threshold that will give the best balance between true positive rate and false positive rate depending on the problem being solved.The actual ROC curves generated using the full validation set are shown in Fig. 4.
To illustrate, consider the case of predicting lung cancer, which has the lowest prevalence among the six diseases explored in this study.Lung cancer is often detected at the advanced stage when prognosis is poor and survival rates are low, thus making it one of the leading causes of cancer-related deaths in the United States.Several strategies that aim to detect the disease at an early stage where intervention is most effective are in place, chief of which are the rule-based screening guidelines provided by the National Comprehensive Cancer Network (NCCN) and the United States Preventive Services Task Force (USPSTF).However, even with these methods in place, only about 2% of annual  Though LDCT can detect lung cancer at a treatable stage, it also poses several health risks especially to those who are otherwise clear of the disease.These include unnecessary treatment, complications and a theoretical risk of developing cancer from exposure to low-dose radiation.Thus, in building a predictive model for lung cancer, these associated costs must be considered together with the objective of identifying as many positive cases as possible.In other words, for this kind of problem, there is a need to minimize the number of false positives while trying to achieve a high true positive rate (TPR).After training an AutoML model using any tool, caution should be exercised when still deploying the models.Models typically provide predictive probabilities and selecting the correct threshold for the application is necessary.Identifying the correct thresholds depending on the trade-offs between TPR and FPR can be done by looking at the respective ROC AUC curves as seen in Fig. 4 We show different confusion matrices for the best performing model for predicting lung cancer in terms of ROC AUC in Supplementary Table 3, Additional File 1. Thresholds are chosen based on deciles of actual predicted probability values for the full validation dataset.Identifying the optimal threshold will depend on the costs of true positives, false positives and false negatives.We consider hypothetical dollar costs for the same model noting that costs in terms of medical risks and quality of life are not included.We assume the per person cost of getting the disease is $300,000 annually if not detected early (equivalent to the cost of a false negative), while if detected early, the cost will be $84,000 annually (equivalent to the cost of a true positive).For this situation, we also consider two hypothetical tests, one priced at $100 per test and LDCT which costs about $500 on average.We compute savings based on the baseline situation where no tests are administered (each person with lung cancer is associated with the cost of a false negative).Figure 5 plots the savings for each hypothetical test cost per person for different decile probability thresholds.The optimal thresholds for the model depend on the situation where the model will be used.For the $100 test, we see the optimal cut-off is at the 70th percentile while for the $500 test, it is at the 90th percentile.For the $500 test, this cut-off is the only one that leads to positive savings.These cut-offs correspond to a FPR = 0.3, TPR = 0.9, and FPR = 0.1 and TPR = 0.52, respectively.

Discussion
Since AutoML software packages are attractive out-of-the-box tools to build predictive models in the context of healthcare data, we examined and compared the performance Fig. 5 Average savings per person for different cut-off thresholds for the H2O (AUROC) model for different test costs.True positive costs are set at $84,000, while false negative costs are set at $300,000.False positive costs are only from the test costs of three of these tools (AutoSklearn, H2O, and TPOT) on a large medical claims dataset for six different disease outcomes.However, these datasets present several challenges.First, the sample size (∼ 12.5M) is much larger than the typical size of datasets analyzed with AutoML.In this work we used a stratified sample of 300k for training, which is still quite large for AutoML given that these approaches are computationally intensive because they are iterating over many different algorithms.For example, the number of generations completed by TPOT within the 48-hour time limit varied greatly for each target disease and scoring metric.The number of generations completed ranged from 7 to 38, with an average of 18.88 across 18 models.Running time for the different AutoML models varied depending on the initial conditions and target conditions.However, for most methods, the running time hit 48 hours.Improvements in terms of scalability of these AutoML methods are certainly desirable in the context of medical claims data.Once training several AutoML models each on a different and relatively large subsample of the dataset becomes computationally feasible, combining the resulting models into an ensemble may provide further performance improvements.
A second challenge is the extremely low case prevalence characteristic of healthcare data; in our examples, this varied from 0.053% to 0.63%.This may be the main culprit for the low AUCPR scores we observed across the methods and diseases.Improvements in terms of handling highly imbalanced datasets are crucial for healthcare applications.One direction for future work is to explore combinations of over-and under-sampling techniques with ensemble approaches in the spirit of [28].
Another challenge which may partly account for the poor performances observed among the models stems from the limitations inherent to the features available in healthcare databases.Since claims are coded for billing purposes, some healthcare services are tied to a certain ICD-10 code which may not necessarily be indicative of the presence of a certain disease.For example, individuals who are eligible for cancer screening will have the screening procedure billed under a cancer ICD-10 code regardless of the result.Hence, individuals who do not have cancer will still have cancer codes in their claims history.This means that simply flagging the presence of these ICD-10 codes is not an accurate representation of the person's medical history.Using fewer selected features may help improve model performance.For example, retaining only features corresponding to ICD-10 codes clinically related to the disease being predicted can reduce the size of the feature set and allow the models to more easily establish relationships between the features and the target.

Conclusion
AutoML tools generally fast track the ML pipeline and the models they generate can serve as starting points for building predictors.However, the performance of these tools on the medical claims datasets used in this study suggest that there may be room for improvement in how AutoML tools handle data of this scale and with such high imbalance.To address the limitations of the data, further feature selection, resampling and imbalance-learning ensembles are possible next steps.
Despite the advantages of using AutoML tools for model selection and optimization, care must still be taken in identifying the optimal output thresholds depending on the research question.

Fig. 1
Fig. 1 Flowchart of framework for benchmarking AutoML tools adapted from Gjisbers et al.

Fig. 2
Fig. 2 ROC AUC performance of different AutoML models trained for various disease outcomes from stratified bootstrap samples.Median values are indicated by diamond markers and 95% CIs are indicated by lines

Fig. 3
Fig. 3 AUCPR performance of different AutoML models trained for various disease outcomes from stratified bootstrap samples.Median values are indicated by diamond markers and 95% CIs are indicated by lines

Fig. 4
Fig. 4 Receiver operating characteristic (ROC) curves of models trained for predicting different diseases.ROC curves are generated using prediction scores on full validation set (N = 12,125,832)

Table 2
Time periods for creating feature flags

Table 3
Median performance ROC AUC scores for different AutoML models scaled according to median random forest performance.Models with the best performance for each disease are indicated in bold

Table 4
Median AUCPR scores for different AutoML models scaled according to median random forest performance.Models with the best performance for each disease are indicated in bold