 Research
 Open Access
 Published:
Probability calibrationbased prediction of recurrence rate in patients with diffuse large Bcell lymphoma
BioData Mining volume 14, Article number: 38 (2021)
Abstract
Background
Although many patients receive good prognoses with standard therapy, 30–50% of diffuse large Bcell lymphoma (DLBCL) cases may relapse after treatment. Statistical or computational intelligent models are powerful tools for assessing prognoses; however, many cannot generate accurate risk (probability) estimates. Thus, probability calibrationbased versions of traditional machine learning algorithms are developed in this paper to predict the risk of relapse in patients with DLBCL.
Methods
Five machine learning algorithms were assessed, namely, naïve Bayes (NB), logistic regression (LR), random forest (RF), support vector machine (SVM) and feedforward neural network (FFNN), and three methods were used to develop probability calibrationbased versions of each of the above algorithms, namely, Platt scaling (Platt), isotonic regression (IsoReg) and shaperestricted polynomial regression (RPR). Performance comparisons were based on the average results of the stratified holdout test, which was repeated 500 times. We used the AUC to evaluate the discrimination ability (i.e., classification ability) of the model and assessed the model calibration (i.e., risk prediction accuracy) using the HL goodnessoffit test, ECE, MCE and BS.
Results
Sex, stage, IPI, KPS, GCB, CD10 and rituximab were significant factors predicting the 3year recurrence rate of patients with DLBCL. For the 5 uncalibrated algorithms, the LR (ECE = 8.517, MCE = 20.100, BS = 0.188) and FFNN (ECE = 8.238, MCE = 20.150, BS = 0.184) models were wellcalibrated. The errors of the initial risk estimate of the NB (ECE = 15.711, MCE = 34.350, BS = 0.212), RF (ECE = 12.740, MCE = 27.200, BS = 0.201) and SVM (ECE = 9.872, MCE = 23.800, BS = 0.194) models were large. With probability calibration, the biased NB, RF and SVM models were wellcorrected. The calibration errors of the LR and FFNN models were not further improved regardless of the probability calibration method. Among the 3 calibration methods, RPR achieved the best calibration for both the RF and SVM models. The power of IsoReg was not obvious for the NB, RF or SVM models.
Conclusions
Although these algorithms all have good classification ability, several cannot generate accurate risk estimates. Probability calibration is an effective method of improving the accuracy of these poorly calibrated algorithms. Our risk model of DLBCL demonstrates good discrimination and calibration ability and has the potential to help clinicians make optimal therapeutic decisions to achieve precision medicine.
Background
Diffuse large Bcell lymphoma (DLBCL) remains a clinical challenge due to its heterogeneous manifestations and prognosis [1, 2]. Although durable remission can be obtained in more than 50% of cases, relapse still occurs in 30–50% of patients with standard therapy, which dramatically reduces their survival rates [3, 4]. Autologous hematopoietic stem cell transplantation (AHSCT), secondline therapy or clinical trials are recommended for these patients with poor response [5, 6]. The accurate prediction of the risk of recurrence in DLBCL patients is crucial to clinical decisionmaking, as it is part of a growing trend toward precision medicine [7]. If patients with high risk of recurrence can be identified as early as possible, their prognosis would be effectively improved by taking appropriate measures e.g. AHSCT. Given that many cases may have recurrences in 3 years, thus, a model that can predict the 3year recurrence rate of DLBCL patients is urgently required.
Statistical or computational models are powerful tools for assessing patient prognosis by simultaneously considering a number of individual features, such as demographic characteristics, disease symptoms and laboratory results. Although many studies have applied statistical models for clinical predictions, many have only focused on whether an event of interest will occur and ignored the estimate of absolute risk of this event. In many scenarios, we need to recognize whether an event will occur and obtain the membership probability, which is critical for further decisionmaking. For example, rather than providing a vague prognosis of survival, if we are able to predict that a patient’s 3year survival rate with a given therapy is 50.1%, we may switch regimens early and choose a more effective regimen. Accurate risk prediction is critical for achieving precision medicine, which can help clinicians make optimal therapeutic determinations. Given accurate information, appropriate therapies may be initiated sooner, thereby preventing unnecessary exposure to ineffective drugs and ultimately improving the clinical outcomes of personalized cases and extending their survival times [7,8,9].
Such a clinical prediction model should be characterized by correctly distinguishing patients who will have an event from those who will not (i.e., discrimination) and by accurately estimating the absolute risk of the event (i.e., calibration) [10]. Discrimination and calibration are both necessary components of the accuracy for a risk prediction model. However, in practice, a model with good classification ability may not necessarily generate precise probability estimates, such as random forest and support vector machine models. Fortunately, these biased algorithms can be corrected by probability calibration methods. Probability calibration attempts to find a mapping function that transforms the initial risk estimates into more accurate posterior probabilities. With probability calibration, it is possible to accurately estimate the risk of recurrence of DLBCL patients even for a poorcalibrated algorithm.
Many approaches have been proposed for the probability calibration problem. Among them, Platt scaling (Platt) is a popular parametric method, which is originally proposed for SVM models [11]. Platt transforms the initial prediction into accurate posterior probability by using a sigmoid function. This method performs well when the distribution of the original probabilities is sigmoidshaped. IsoReg (isotonic regression), the monotone extension of HistBin (histogram binning), is a popular nonparametric method [12, 13]. Since the only restriction is that the calibration function is isotonic (i.e., nondecreasing), IsoReg have the ability to calibrate any classifiers. Subsequently, Jiang [14] proposed SmoIsoReg (smooth isotonic regression), which is a continuousness extension of the IsoReg. SmoIsoReg first trains an IsoReg model and selects a set of representative points based on the piecewise constant solution generated by IsoReg. Then, the calibration function is estimated by applying PCHIP [15] interpolation algorithm to fit these points. In addition, stateoftheart approaches such as BBQ (Bayesian binning in quantiles), GUESS and RPR (shaperestricted polynomial regression) have also been proposed to calibrate predictive models. BBQ [16, 17] integrates multiple HistBin models of different bins to generate calibrated probabilities. GUESS [18] first fits the distribution of the original scores of different classes, and then uses Bayes’ theorem to compute the probability (i.e., calibrated probability) that a certain score belongs to the interested class. RPR [19] uses a polynomial function as the calibration function and can theoretically calibrate the initial predictions of any distribution as the polynomial degree increases. In this article, the popular parametric method Platt, the popular nonparametric method IsoReg, and the flexible RPR were used to calibrate the risk prediction model for accurately predicting the 3year recurrence rate of DLBCL patients.
Overall, we will use 5 traditional machine learning algorithms to predict the 3year recurrence rate of patients with DLBCL: naïve Bayes (NB), logistic regression (LR), random forest (RF), support vector machine (SVM) and feedforward neural network (FFNN) models. Previous studies showed that all of these algorithms have good classification ability; however, to our knowledge, they are rarely used for risk estimation. Thus, we will explore their calibration performance using our realworld data. Moreover, three methods (i.e., Platt, IsoReg and RPR) will be applied to develop probability calibrationbased versions of each of the above algorithms. We will use the HosmerLemeshow (HL) goodnessoffit test, expected calibration error (ECE), maximum calibration error (MCE) and Brier score (BS) to comprehensively assess the accuracy of the risk prediction. We will also explore the performance of all models on different probabilistic intervals.
This research has three objectives. First, unlike other studies that only focused on the prediction of categories, we aim to generate accurate probability estimates. Second, instead of using traditional methods, we will develop probability calibrationbased machine learning algorithms for risk prediction. Third, both discrimination and calibration will be considered in the performance measure.
Methods
Study populations and predictors
The dataset used in this study was provided by Shanxi Cancer Hospital, China. A total of 510 patients diagnosed with DLBCL between 2011 and 2017 were included in the model construction. There were 181 cases, which had experienced relapse within 3 years. We collected 15 features of each patient from their electronic medical records. Table 1 shows the names and groupings of each feature.
We employed a LR model and RF algorithm to analyze these variables. The LR model can detect possible causal relationships between variables and identify important variables related to the outcome [20]. Table 2 shows the selected variables of the LR model when the threshold is 0.1. Sex, stage, IPI, KPS, GCB, CD10 and rituximab were significant factors for recurrence in DLBCL patients within 3 years. Except for stageII, the P values of other variables were all less than 0.05.
The RF algorithm can perform feature selection by analyzing the importance of variables [20, 21]. In this research, mean decrease of accuracy and mean decrease of Gini index were selected to measure the importance of variables. The former calculates the average reduction in prediction accuracy of the model in the Out of Bag (OOB) samples after a certain variable is removed. The larger the mean decrease of accuracy, the more important the variable is to the model. The Gini index, which reflects the likelihood that two samples taken at random from a data set will have different labels, is used to measure the impurity of this data. The mean decrease of Gini index calculates the average reduction of the node impurity in all decision trees after a certain variable is used as the partition attribute. The larger the value, the more important the variable is to the model.
Figure 1 shows the ranking of variable importance. To compare with the result of the LR model, we only focused on the top 7 variables of the ranking. The union of the two rankings contained 10 variables, including 7 variables selected by the LR model, as well as WBC, Ki67 and β_{2}MG. Regardless of which importance measure was used, IPI and stage were ranked in the top 2, and both rankings contained WBC and KPS.
Based on the results of these two methods, we first used the variables (sex, stage, IPI, KPS, GCB, CD10, and rituximab) selected by the LR model as the predictors of the risk model. According to these 7 variables, we pretrained the 5 machine learning algorithms with 100 times. Then, we further incorporated the WBC, Ki67 and β_{2}MG variables into each algorithm to observe changes in performance. Since the predictive performances of all models were not significantly improved after included these 3 variables, we excluded them for the sake of simplicity of the model. Finally, sex, stage, IPI, KPS, GCB, CD10 and rituximab were used as the predictors to predict the 3year recurrence rate of patients with DLBCL.
Five machine learning algorithms
Five common machine learning algorithms that showed good classification ability in previous reports were explored, namely, the NB, LR, RF, SVM and FFNN models.
The NB classifier [22], which calculates the posterior probability that an example belongs to each member according to Bayes’ theorem, partitions the example into the member with the largest posterior probability. The LR model [23] has the “regression” term but actually belongs to a class of generalized linear models that solves classification tasks. Since it uses the logistic function as the link function, LR can generate the posterior probability that an observation belongs to a certain class.
The RF algorithm [24], which generates a series of “bootstrap” datasets of identical size as the original data based on sampling with replacement, develops a decision tree on each bootstrapped dataset. The results of all trees are voted (classification problem) or averaged (regression problem) to obtain the final prediction. In this research, the voting ratio of all decision trees was used as the probability estimate of the RF algorithm.
The SVM model [25], which is a generalization of the maximal margin classifier, attempts to find a separating hyperplane to partition samples into different classes. SVM classifies examples according to their scores s(x), which are proportional to the distance from x to the separating hyperplane. The sign of the score determines the category, and its magnitude can also be used as the measure of predictive confidence since an example far from the separating hyperplane is more likely to be classified correctly [13]. Although s(x) ∈ R, we can scale them into an interval between 0 and 1 by using minmax normalization.
An artificial neural network (ANN) [26] consists of a number of simple adaptive units and represents a wide parallel interconnection network. The FFNN is a common network structure in which the units in each layer are fully connected to the units in the next layer and there is no loop in the structure. In this study, we developed a 3layer network structure, including one input layer, one hidden layer and one output layer. The hidden layer contained 1000 units, and the output layer consisted of a single unit that used the sigmoid function as the active function. Our FFNN had a large number of hidden units since the network with excess capacity has better generalization than the simple network when using back propagation and early stopping training [27,28,29]. Studies have showed that a multilayer feedforward network, which has a single hidden layer containing enough neurons, can approximate a continuous function with arbitrary complexity [30].
Three probability calibration methods
We employed 3 methods (Platt, IsoReg, and RPR) to develop probability calibrationbased versions of the above 5 machine learning algorithms. A total of 20 models were established in our research, including the 5 uncalibrated algorithms.
Probability calibration tries to find a mapping function that transforms the initial probability estimate or score of a classifier into more accurate prediction, i.e., find a calibration function f that satisfies following objective [31]:
where s is the initial probability estimate or score of an example x. P is the true probability of this example belongs to the category of interest (i.e., Y = 1).
Platt maps the original prediction into accurate posterior probability by using a sigmoid function [11]. The calibrated probability is generated by the following function:
The parameters A and B are estimated by using the maximum likelihood estimation (MLE) on the calibration training set \( {\left\{\left({s}_i,{y}_i\right)\right\}}_{i=1}^N \). To avoid overfitting, y_{i} = (N_{+} + 1)/(N_{+} + 2) if the example belongs to the positive member; otherwise, y_{i} = 1/(N_{−} + 2). Constants N_{+} and N_{−} are the number of positive and negative examples in the training data, respectively.
IsoReg calibrates the initial prediction by using an isotonic (nondecreasing) function f that satisfies the following restriction [13]:
Pairadjacent violators (PAV) algorithm is often used to estimate the isotonic function [32]. With this algorithm, the examples are first sorted according to their initial predictions, and all positive samples have a probability of 1 and all negative samples have a probability of 0. A sequence of assigned probabilities can be obtained, i.e., y_{i} = [ y_{1} y_{2}…y_{N}]. Subsequently, recursively replace a pairadjacent violator with their average of assigned probabilities, e.g., if y_{n} > y_{n + 1} (pairadjacent violator), then update both with their average. The above replacement is executed recursively until f(y_{1}) ≤ f(y_{2}) ≤ … ≤ f(y_{N}). Finally, we can obtain a stepwise constant solution over the interval of initial predictions. To predict a new example x, we find the ith interval in which the s(x) is located and assign f(i) as the calibrated probability for this example.
Compared to the Platt and IsoReg, RPR is a more flexible and powerful method that uses a polynomial function to calibrate a classifier [19]:
The polynomial coefficients a are solved by the following optimization problem:
All calibrated probabilities are forced to fall in the interval between 0 and 1 by using the restriction (a). Restriction (b) derives from the differentiability of f(s), and is used to ensure the monotonicity of the calibration function. In the restriction (c), a l_{1}norm of coefficients is used to avoid overfitting of the polynomial.
Model construction
The construction and evaluation of all models are completed by using the stratified holdout test. We randomly sampled twothirds of the observations (340) as the training data and the residual observations (170) as the testing data. To ensure the consistency of the data distribution, stratified sampling was used to partition the data. To reduce the statistical variability, the above partition and evaluation were repeated 500 times. The performance comparison was based on the average results of the 500 holdout tests.
We first developed traditional NB, LR, RF, SVM and FFNN models for risk prediction. Threefold crossvalidation was performed on the training data to determine the optimal hyperparameters of the RF, SVM and FFNN models. For the RF, the choices for the number of candidate attributes of each node partition were {2, 3}, and the number of decision trees was selected from {500, 600, 700…, 1500}. For the SVM, the kernel was selected from the linear or Gaussian kernels. The search space for the parameters C and gamma was \( {\left\{{10}^i\right\}}_{i=4}^4 \). For the FFNN, the training epoch was determined by the validation sets. Subsequently, we used all training data to fit the NB and LR models and trained the RF, SVM and FFNN models with the determined hyperparameters. Finally, we assessed their performance on the testing data. To extract the predicted values of the model in the validation sets, we also performed 3fold crossvalidation on the training set for the NB and LR models, although they have no hyperparameters that need to be determined.
Then, we developed probability calibrationbased versions of the above 5 algorithms. To avoid overfitting, we used the union of the predicted values on the 3 validation sets of the above 5 algorithms as the training set of the calibration function. We first employed 3fold crossvalidation on the calibration training set to determine the optimal hyperparameters of the RPR. The choices for the polynomial degree k were {4, 5, …, 20}, and the choices for regularization constant λ were \( {\left\{{4}^i\right\}}_{i=0}^5 \). Subsequently, we used all training data from the calibration to fit Platt, IsoReg and the RPR with the determined k and λ. Finally, we calibrated the predicted values on the testing set of the 5 algorithms by using the trained Platt, IsoReg and RPR models and then assessed their performances.
Model evaluation
Although our purpose is to generate accurate risk estimates, classification ability is the foundation of a prediction model. When a model has a poor discrimination ability, then the accuracy of the predicted probabilities does not need to be further evaluated [10]. Thus, both discrimination ability and calibration ability of the model were considered in the performance evaluation. Discrimination is the ability to differentiate those at lower risk of an event of interest from those at higher risk. Calibration measures the similarity between predicted risk and true risk in patients in different risk strata. In our study, we used the AUC to assess the discrimination and measured the calibration by using the HL test, ECE, MCE and BS.
The HL test, ECE and MCE are metrics related to the calibration plot. To calculate these metrics, all examples are first sorted according to their predictions and then divided into k bins of similar size. In each bin, the predicted risk is the mean of the predictions of all examples in the bin and the true or observed risk is the ratio of positive members in the bin. The HL test can measure whether the difference between the predicted risk and the true risk is caused by sampling error [33]:
\( {O}_i^c \) is the sum of cases with c = 0 or c = 1 in the ith bin. \( {P}_i^c \) is the sum of predicted probabilities with c = 0 or c = 1 in the ith bin. The statistic C_{H − L} is then compared to a chisquare distribution with k − 2 degrees of freedom. The ECE and MCE calculate the average and maximum predicted errors of these bins, respectively [17]:
The p_{i} and o_{i} are the predicted risk and the observed risk in the ith bin, respectively. The BS is another metric to assess the calibration ability of a model:
The p_{m} is the predicted risk of an example and the y_{m} is true label of this example. Lower ECE, MCE and BS values corresponding to a lower risk of prediction errors.
Results
We first developed the NB, LR, RF, SVM and FFNN models and then used 3 methods (Platt, IsoReg, and RPR) to construct probability calibrationbased versions of these algorithms. The performance comparison was based on the average results of the holdout test repeated over 500 rounds. A model that obtained a HL test value greater than 0.05 was defined as a wellcalibrated model.
Five traditional machine learning algorithms
As shown in Table 3, the AUCs of the 5 algorithms were approximately 0.75, suggesting that they achieved useful discrimination. Except for the SVM, the AUCs of the other 4 algorithms were all greater than 0.75. In terms of the AUC, the FFNN had the best classification capacity, followed by the NB model.
From the calibration, the LR and FFNN models were well calibrated. For these two algorithms, both the ECE and BS values of the FFNN were lower than those of the LR model, whereas the MCE value was slightly higher than that of the LR model. By comparison, the NB, RF and SVM models were poorly calibrated and had large errors in the probability estimate. Among them, the NB model had the lowest accuracy (ECE = 15.711, MCE = 34.350, BS = 0.212), followed by the RF model (ECE = 12.740, MCE = 27.200, BS = 0.201).
Probability calibrationbased models
Since the Platt, IsoReg and RPR methods do not change the order of the predictions of the examples, the AUCs of all calibrated models will not be discussed in this section. The results are shown in Table 4.
Through probability calibration, the errors of the NB, RF and SVM models decreased significantly, especially for the NB model. Except for the BS value in the LR model, the calibration errors of the LR and FFNN models were not further decreased, regardless of the probability calibration method. Of the 3 calibration methods used, RPR obtained the best correction for the RF and SVM models, regardless of the ECE, MCE or BS metric. For the NB algorithm, NBRPR had the lowest ECE, NBPlatt had the lowest MCE, and the BS values of the two models were identical. For these 3 poorly calibrated algorithms (NB, RF, and SVM), the correction effects of IsoReg were not obvious. The ECEs of the NBIsoReg, RFIsoReg and SVMIsoReg models decreased compared to those of the uncalibrated models, whereas the MCEs of these models increased to different degrees. In addition, the BS value of SVMIsoReg was also higher than that of the uncalibrated model, while the BS values of NBIsoReg and RFIsoReg were lower than or equal to those of the uncalibrated models.
Improvement of the calibration
We further explored improving the model calibration performance after probability calibration. In terms of the HL test, if the result of a model was not statistically significant (P > 0.05), then it was defined as wellcalibrated; otherwise, it was defined as poorly calibrated. Since the LR and FFNN models were wellcalibrated, their calibrated models were not discussed in this section. The results are shown in Fig. 2.
For the 5 uncalibrated models, the FFNN had the highest frequency (403) of achieving a wellcalibrated performance out of 500 evaluations, followed by the LR model (341). By comparison, the frequencies of the NB, RF and SVM models were 1, 0 and 190, respectively. Of these poorly calibrated algorithms (NB, RF, and SVM), the probability calibration improved their performances significantly. Compared with Platt and IsoReg, the RFRPR and SVMRPR models achieved the highest number of wellcalibrated performances, which were 395 and 391 rounds, respectively. For the NB model, NBPlatt had the highest frequency (383), followed by NBRPR (375).
Distribution of probability estimates
We finally explored the distribution of all estimated probabilities. According to the fixed cut points of 0.1, 0.2, …, 1, all examples were grouped based on their predictions. In each interval, we calculated the count of examples and expressed it using the median of 500 holdout tests. Since the LR and FFNN models achieved good calibration, the results of their calibrated models were not discussed in this section. The results are shown in Fig. 3.
For the two wellcalibrated models (LR and FFNN), the peaks clustered around the interval between 0.1 and 0.2. There was no example near the point where the predicted value was 1. Between 0.3 and 1, the numbers of examples decreased gradually as the probability increased.
For the uncalibrated NB model, the peaks were concentrated at approximately 0 and 1, and the former accounted for a larger proportion. Between 0.1 and 0.9, the count of each interval was roughly identical. For the 3 calibrated NB models, most estimated probabilities appeared in the interval between 0.1 and 0.2. For the NBPlatt and NBRPR models, the number of examples with predicted probabilities of approximately 0 and 0.9 was 0.
For the uncalibrated RF model, the peak is approximately 0. Between 0 and 1, the count decreased gradually as the probability increased. For the 3 calibrated RF models, most estimated probabilities appeared in the interval between 0.1 and 0.2. For the RFPlatt and RFRPR models, the number of examples with predicted probabilities of approximately 0 and 1 was 0.
For the uncalibrated SVM model, the peak at approximately 0.2. For the SVMPlatt and SVMRPR models, most estimated probabilities appeared in the interval between 0.2 and 0.3, while the peak of the NBIsoReg appeared in the interval between 0.1 and 0.2. For the SVMPlatt and SVMRPR models, the number of examples with predicted probabilities of approximately 1 was 0. There were also no examples near points where the probability was 0 for the SVMPlatt model. In the interval between 0.3 and 1, the number of examples of the 4 models decreased regularly as the probability increased.
Discussion
We developed probability calibration versions of the 5 traditional machine learning algorithms to predict the 3year recurrence rate in patients with DLBCL and validated them in terms of both discrimination and calibration. Although the initial risk prediction of several algorithms had large errors, probability calibration improved their accuracy.
We used 7 variables, i.e., sex, stage, IPI, KPS, GCB, CD10 and rituximab, to predict the 3year recurrence rate of patients with DLBCL. Most of these variables are associated with the clinical outcome of DLBCL. To our knowledge, the prognosis of patients is highly correlated with the tumor stage in almost all cancers. The higher the stage, the more severe the disease and the more complex the treatment; thus, a poor prognosis is likely. This fact is also true in DLBCL [34]. IPI is often used to estimate a patient’s prognosis by clinicians, and it is a recognized prognostic indictor of DLBCL [34, 35]. The IPI value is between 1 and 5, and a higher value corresponds to a greater likelihood that the patient will have a poor clinical outcome. DLBCL can be further classified into two (GCB and nonGCB) categories based on the expression of specific proteins. Significant differences in prognosis were observed between these two types, and the overall survival rate was considerably inferior in nonGCB patients [36,37,38,39]. In addition, several studies have suggested that the expression of CD10 is closely associated with patient survival and has a favorable effect on clinical outcomes [40, 41]. The application of rituximab is a breakthrough in DLBCL, and current studies have shown that rituximab improves survival in almost all DLBCL subgroups [4, 42,43,44]. The KPS reflects the physical condition of a patient, and a higher score corresponds to a better condition. Although few studies have focused on the correlation between KPS and DLBCL, we speculate that the performance status will affect patient treatment, such as the drug dosage, and thus indirectly affect patient prognosis.
The 5 machine learning algorithms discussed in this study are often used in classification tasks, and they all have good discrimination ability. In our research, although their discrimination performances were very similar, the differences in calibration were large. Both the LR and FFNN models were well calibrated, and their performances were not further improved after probability calibration. Their low calibration errors were more likely the result of a direct optimization for logloss of probability [45]. By comparison, the NB, RF and SVM models were poorly calibrated, and their errors in estimated probabilities were large. The NB model only achieved good calibration once out of 500 evaluations. Studies have suggested that the predictions of the NB model are often pushed to 0 or 1 since its basic assumption (i.e., assume that each variable affects the result independently) may not be valid in reality [12, 13, 45]. In our study, the predictions of the NB model were concentrated at approximately 0 and 1, with the former accounting for a larger proportion. For the RF model, a good calibration performance was not achieved once out of 500 evaluations. To increase the difference between decision trees, the RF algorithm introduces the sample and attribute perturbations when constructing each tree. Several studies have suggested that it is difficult to get identical predictions from all trees; thus, the voting ratios of the RF are often pushed away from 0 and 1 [31, 45, 46]. However, most predictions from the RF model are concentrated at approximately 0, and the number of examples in the interval between 0.9 and 1 is not the lowest in our study. We suggest that three reasons may explain this difference. First, each decision tree of the RF model has good classification ability since our data are not complex. Despite the diversity imposed on the tree, most of them generate the same output. Second, the negative examples account for a large proportion in our study. Third, the RF model achieves high discriminative power for these negative examples. Furthermore, the SVM model pushes the outputs away from 0 and 1, which is consistent with the previous study [45]. Our study also suggests that probability calibration is necessary for the SVM algorithm since normalizing its scores is insufficient to obtain accurate probability estimates.
We selected 3 methods (Platt, IsoReg, and RPR) to develop probability calibrationbased versions of 5 traditional machine learning algorithms. Platt is a popular parametric method that uses a sigmoid function to calibrate a classifier. If the distribution of the initial probability estimates is inconsistent with the assumed parametric form, however, Platt does not work well. In our study, the biased NB, RF and SVM models were wellcorrected by the Platt method. If a classifier can rank examples correctly, then the mapping function from initial predictions into accurate probabilities should be nondecreasing. Based on this assumption, IsoReg uses an isotonic (i.e., nondecreasing) function to calibrate the biased prediction. Due to its simple restriction, IsoReg has become a popular nonparametric probability calibration method with good universal ability. However, the NBIsoReg, RFIsoReg and SVMIsoReg models in our study were still poorly calibrated. Although the ECE values of these 3 models were all lower than those of the uncalibrated models, their MCEs were all increased. After investigation, we found that the calibration error of IsoReg for those examples with high predicted values is large. We speculate that overfitting occurred in these intervals with high predicted values since there were insufficient positive examples in our study. When the calibration set is small, the risk of IsoReg overfitting is large. NiculescuMizil and Caruana [45] also confirmed that IsoReg is not suitable for the case of training sizes less than 1000. By comparison, RPR is more powerful and flexible. Compared with Platt, RPR uses a polynomial function to calibrate a classifier and can theoretically correct the initial predictions of any distribution as the polynomial degree increases. Unlike IsoReg, the calibration function of RPR is continuous over the entire interval. Therefore, two examples with similar predicted values will not differ considerably after calibration. In our study, RPR achieved the best correction for the RF and SVM models in terms of ECE, MCE and BS values. For the NB model, NBRPR was best in terms of the ECE, although its MCE was slightly higher than that of NBPlatt.
This paper focused on calibration rather than discrimination and aimed to provide accurate membership probability (i.e., the 3year recurrence rate of patients with DLBCL). In practice, we will never know the true membership probability and we usually use the empirical probability (i.e., the proportion of positive events under a certain score or within a certain interval of score) to measure the membership probability. For a sample in which the event of interest has occurred, the true membership probability is not necessarily 100%. In fact, it may be 0.5, 0.6 or other values, just the existence of “probability” allows us to observe the occurrence of this event. In chapter 3.4, we can find in this research that there were some estimated probabilities that fell in the middle of the [0, 1] interval even if a wellcalibrated model. These probabilities with moderate values such as those between 0.3 and 0.7 may be considered less confident for a classification task (assuming that the cutoff of classification is 0.5), since they are near the threshold. However, these moderate predictions would be of enormous help to clinical practice if the focus is on calibration rather than discrimination. For example, probabilities include those with moderate values can be used as the basis of patient risk stratification, e.g. patients with a predicted value of less than 0.3 can be regarded as lowrisk individuals, those with a predicted value of 0.3 to 0.7 as mediumrisk individuals, and those with a predicted value of more than 0.7 as highrisk individuals. Then, personalized treatments or interventions can be applied to different groups to improve the clinical outcomes of patients with distinct prognostic characteristics. Currently, estimating membership probability has received more and more attention and has critical clinical significance as the advent of precision medicine era [7]. Accurate risk estimates based on personalized characteristics can help improve individual risk counseling, stratification of patients for clinical trials, and timing of clinical intervention [7, 47]. Moreover, the exclusion of patients who are unlikely to respond to a standard treatment can minimize the exposure of patients to costly therapies that are unlikely to help them [7]. The risk model developed in our study achieved good performance on both discrimination and calibration and has the potential to improve the clinical outcomes of patients with DLBCL.
This research has limitations. First, the calibration performance can be further improved. Since the calibration function has to ensure monotonicity over the entire interval of initial predicted values, the calibrated probability of an example may not change significantly. Therefore, the calibration error will be largely influenced by those misclassified examples. We will collect more information of patients to improve the discriminative ability of the model, thus, indirectly increase the accuracy of the estimated probabilities. Second, only 5 machine learning algorithms are discussed in this study. The other algorithms and their probabilitycalibrationbased versions can be further explored. Third, the data used in this study are provided by a certain hospital, therefore, an external validation is needed to evaluate the generalizability of the model.
Conclusions
To accurately predict the 3year recurrence rate of patients with DLBCL, we developed probability calibrationbased versions of 5 traditional machine learning algorithms. In the current study, we could show that (i) some algorithms (i.e., NB, RF and SVM models) when predicting the 3year recurrence rate of DLBCL patients cannot generate accurate risk estimates, although they have good discrimination capacity. The evaluation of performance via ECE, MCE and BS values showed that probability calibration improves the calibration performance of these algorithms effectively. Especially for the NB model, probability calibration reduced the ECE value from 15.711 to 8.743, the MCE value from 34.350 to 21.550, and the BS value from 0.212 to 0.189. These improvements provided by probability calibration are helpful to clinical practice, for example, DLBCL patients with high risk of recurrence would be identified more accurately (ii) Probability calibration did not further reduce the probabilistic error of the FFNN model in this research, regardless of which calibration method was used. Among the 20 models developed, the uncalibrated FFNN model performed best in terms of the ECE and BS values. This result may indicate that accurate risk estimates can be obtained directly by selecting a wellcalibrated model in advance, without additional probability calibration.
Availability of data and materials
The dataset generated and analyzed during the current study are not publicly available due to subsequent studies have not been completed but are available from the corresponding author on reasonable request.
Abbreviations
 DLBCL:

Diffuse Large Bcell Lymphoma
 NB:

Naïve Bayes
 LR:

Logistic Regression
 RF:

Random Forest
 SVM:

Support Vector Machine
 FFNN:

Feedforward Neural Network
 Platt:

Platt Scaling
 IsoReg:

Isotonic Regression
 RPR:

Shaperestricted Polynomial Regression
 AUC:

Area Under the Receiveroperating Characteristic Curve
 HL:

HosmerLemeshow
 ECE:

Expected Calibration Error
 MCE:

Maximum Calibration Error
 BS:

Brier Score
 IPI:

International Prognostic Index
 KPS:

Karnofsky Performance Status
 WBC:

White Blood Cell
 LDH:

Lactate Dehydrogenase
 β _{2}MG:

β_{2}Microglobulin
 ESR:

Erythrocyte Sedimentation Rate
 GCB:

Germinal Center Bcelllike Lymphoma
References
 1.
Pasqualucci L, DallaFavera R. Genetics of diffuse large Bcell lymphoma. Blood. 2018;131(21):2307–19. https://doi.org/10.1182/blood201711764332.
 2.
Nijland M, Boslooper K, Imhoff GV, et al. Relapse in stage I(E) diffuse large Bcell lymphoma. Hematol Oncol. 2017;36(2):416–21. https://doi.org/10.1002/hon.2487.
 3.
Roschewski M, Staudt LM, Wilson WH. Diffuse large Bcell lymphoma—treatment approaches in the molecular era. Nat Rev Clin Oncol. 2014;11(1):12–23. https://doi.org/10.1038/nrclinonc.2013.197.
 4.
Coiffier B, Lepage E, Brière J, Herbrecht R, Tilly H, Bouabdallah R, et al. CHOP chemotherapy plus rituximab compared with CHOP alone in elderly patients with diffuse largeBcell lymphoma. N Engl J Med. 2002;346(4):235–42. https://doi.org/10.1056/NEJMoa011795.
 5.
Zelenetz A, Gordon L, Abramson J. NCCN clinical practice guidelines in oncology: Bcell lymphomas. Version 5. Plymouth, USA: BCELC; 2019.
 6.
Gisselbrecht C, Glass B, Mounier N, Singh Gill D, Linch DC, Trneny M, et al. Salvage regimens with autologous transplantation for relapsed large Bcell lymphoma in the rituximab era. J Clin Oncol. 2010;28(27):4184–90. https://doi.org/10.1200/JCO.2010.28.1618.
 7.
Jameson JL, Longo DL. Precision medicine — personalized, problematic, and promising. N Engl J Med. 2015;372(23):2229–34. https://doi.org/10.1056/NEJMsb1503104.
 8.
Stenberg E, Cao Y, Szabo E, Näslund E, Näslund I, Ottosson J. Risk prediction model for severe postoperative complication in bariatric surgery. Obes Surg. 2018;28(7):1869–75. https://doi.org/10.1007/s1169501730992.
 9.
Degnim AC, Winham SJ, Frank RD, Pankratz VS, Dupont WD, Vierkant RA, et al. Model for predicting breast cancer risk in women with atypical hyperplasia. J Clin Oncol. 2018;36(18):1840–6. https://doi.org/10.1200/JCO.2017.75.9480.
 10.
Alba AC, Agoritsas T, Walsh M, Hanna S, Iorio A, Devereaux PJ, et al. Discrimination and calibration of clinical prediction models: users’ guides to the medical literature. JAMA. 2017;318(14):1377–84. https://doi.org/10.1001/jama.2017.12126.
 11.
Platt J. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classifiers. 1999;10(3):61–74.
 12.
Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. ICML. 2001;1:609–16.
 13.
Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining; 2002. p. 694–9.
 14.
Jiang X, Osl M, Kim J, et al. Smooth isotonic regression: A new method to calibrate predictive models. In: AMIA Summits on Translational Science Proceedings, vol. 2011; 2011. p. 16.
 15.
Fritsch FN, Carlson RE. Monotone piecewise cubic interpolation. SIAM J Numer Anal. 1980;17(2):238–46. https://doi.org/10.1137/0717021.
 16.
Naeini MP, Cooper G, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29; 2015. p. 2901–7.
 17.
Naeini MP, Cooper G, Hauskrecht M. Binary classifier calibration using a Bayesian nonparametric approach. In: Proceedings of the 2015 SIAM International Conference on Data Mining; 2015. p. 208–16.
 18.
Schwarz J, Heider D. GUESS: projecting machine learning scores to wellcalibrated probability estimates for clinical decisionmaking. Bioinformatics. 2019;35(14):2458–65. https://doi.org/10.1093/bioinformatics/bty984.
 19.
Wang Y, Li L, Dang C. Calibrating classification probabilities with shaperestricted polynomial regression. IEEE Trans Pattern Anal Mach Intell. 2019;41(8):1813–27. https://doi.org/10.1109/TPAMI.2019.2895794.
 20.
Neumann U, Riemenschneider M, Sowa JP, Baars T, Kälsch J, Canbay A, et al. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Mining. 2016;9(1):36. https://doi.org/10.1186/s1304001601144.
 21.
James G, Witten D, Hastie T, et al. TreeBased Methods. In: An introduction to statistical learning with applications in R. Berlin: Springer; 2013. p. 303–32.
 22.
Zhou Z. Naive Bayes Classifier. In: Maching Learning. Beijing: Tsinghua University Press; 2016. p. 150–4.
 23.
McCulloch CE, Searle SR. Generalized Linear Models (GLMs). In: Generalized, Linear, and Mixed Models. USA: Wiley; 2008. p. 135–56.
 24.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. https://doi.org/10.1023/A:1010933404324.
 25.
James G, Witten D, Hastie T, et al. Support Vector Machines. In: An introduction to statistical learning. Berlin: Springer; 2013. p. 337–68.
 26.
Kohonen T. An introduction to neural computing. Neural Netw. 1988;1(1):3–16. https://doi.org/10.1016/08936080(88)900202.
 27.
Weigend A. On overfitting and the effective number of hidden units. Proc Connect Models Summer School. 1993;1:335–42.
 28.
Caruana R, Lawrence S, Giles CL. Overfitting in neural nets: backpropagation, conjugate gradient, and early stopping. Neural Inf Process Syst. 2000:402–8.
 29.
Lawrence S, Giles CL, Tsoi AC. Lessons in neural network training: overfitting may be harder than expected. In: National Conference On Artificial Intelligence; 1997. p. 540–5.
 30.
Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989;2(5):359–66. https://doi.org/10.1016/08936080(89)900208.
 31.
Boström H. Calibrating random forests. In: 2008 Seventh International Conference on Machine Learning and Applications, vol. 2008. p. 121–6.
 32.
Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman E. An empirical distribution function for sampling with incomplete information. Ann Math Stat. 1955;26(4):641–7. https://doi.org/10.1214/aoms/1177728423.
 33.
Hosmer DW, Hosmer T, Le Cessie S, et al. A comparison of goodnessoffit tests for the logistic regression model. Stat Med. 1997;16(9):965–80. https://doi.org/10.1002/(SICI)10970258(19970515)16:9<965::AIDSIM509>3.0.CO;2O.
 34.
Zhang A, Ohshima K, Sato K, et al. Prognostic clinicopathologic factors, including immunologic expression in diffuse large Bcell lymphomas. Pathol Int. 2010;49(12):1043–52.
 35.
Chinese Society of Hematology. Guidelines for the diagnosis and treatment of diffuse large Bcell lymphoma in China (2013 edition). Chin J Hematol. 2013;34(9):816–9.
 36.
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, et al. Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature. 2000;403(6769):503–11. https://doi.org/10.1038/35000501.
 37.
Nedomova R, Papajik T, Prochazka V, Indrak K, Jarosova M. Cytogenetics and molecular cytogenetics in diffuse large Bcell lymphoma (DLBCL). Biomed Papers Med Faculty Palacky Univ Olomouc. 2013;157(3):239–47. https://doi.org/10.5507/bp.2012.085.
 38.
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma. N Engl J Med. 2002;346(25):1937–47. https://doi.org/10.1056/NEJMoa012914.
 39.
Bea S, Zettl A, Wright G, Salaverria I, Jehn P, Moreno V, et al. Diffuse large Bcell lymphoma subgroups have distinct genetic profiles that influence tumor biology and improve geneexpressionbased survival prediction. Blood. 2005;106(9):3183–90. https://doi.org/10.1182/blood2005041399.
 40.
Ohshima K, Kawasaki C, Muta H, Muta K, Deyev V, Haraoka S, et al. CD10 and Bcl10 expression in diffuse large Bcell lymphoma: CD10 is a marker of improved prognosis. Histopathology. 2001;39(2):156–62. https://doi.org/10.1046/j.13652559.2001.01196.x.
 41.
Bai M, Agnantis N, Skyrlas A, et al. Increased expression of the bcl6 and CD10 proteins is associated with increased apoptosis and proliferation in diffuse large Bcell lymphomas. Mod Pathol. 2003;16(5):471–80. https://doi.org/10.1097/01.MP.0000067684.78221.6E.
 42.
Fu K, Weisenburger DD, Choi WWL, Perry KD, Smith LM, Shi X, et al. Addition of rituximab to standard chemotherapy improves the survival of both the germinal center Bcelllike and nongerminal center Bcelllike subtypes of diffuse large Bcell lymphoma. J Clin Oncol. 2008;26(28):4587–94. https://doi.org/10.1200/JCO.2007.15.9277.
 43.
Coiffier B, Thieblemont C, Van DN. E., et al. longterm outcome of patients in the LNH98.5 trial, the first randomized study comparing rituximabCHOP to standard CHOP chemotherapy in DLBCL patients : a study by the Groupe d'Etudes des Lymphomes de l'Adulte. Blood. 2010;116(12):2040–5. https://doi.org/10.1182/blood201003276246.
 44.
Pfreundschuh M, Trümper L, Osterborg A, Pettengell R, Trneny M, Imrie K, et al. CHOPlike chemotherapy plus rituximab versus CHOPlike chemotherapy alone in young patients with goodprognosis diffuse largeBcell lymphoma: a randomised controlled trial by the MabThera international trial (MInT) group. Lancet Oncol. 2006;7(5):379–91. https://doi.org/10.1016/S14702045(06)706647.
 45.
NiculescuMizil A, Caruana R. Predicting good probabilities with supervised learning. Bonn: Association for Computing Machinery; 2005. p. 625–32.
 46.
Boström H. Estimating class probabilities in random forests. In: International Conference on Machine Learning and Applications; 2007. p. 211–6.
 47.
Westeneng HJ, Debray TPA, Visser AE, van Eijk RPA, Rooney JPK, Calvo A, et al. Prognosis for patients with amyotrophic lateral sclerosis: development and validation of a personalised prediction model. Lancet Neurol. 2018;17(5):423–33. https://doi.org/10.1016/S14744422(18)300899.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Natural Science Foundation of China [Grant Number: 81502897 and 81973154], PhD Fund of Shanxi Medical University [Grant Number: BS2017029], to guarantee our investigation. The funder Hongmei Yu and Yanhong Luo, provided many valuable suggestions for design, analysis and manuscript writing of the study.
Author information
Affiliations
Contributions
Shuanglong Fan analyzed and interpreted the data and drafted the manuscript. Zhiqiang Zhao, Yanbo Zhang and Hongmei Yu were responsible for preprocessing the data and checking the results. Chuchu Zheng, Xueqian Huang, Zhenhuan Yang and Meng Xing participated in the collection of the data. Qing Lu and Yanhong Luo provided the methods and reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
This study was approved by the Shan Xi Tumor Hospital Ethics Committee and obtained the reference number of 201835. All participants were informed and agreed to the study. We obtained the informed with oral consent form each participant. All research process was approved by the ethics committee, and all methods carried out in accordance with relevant guidelines and regulations in ethics.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Fan, S., Zhao, Z., Zhang, Y. et al. Probability calibrationbased prediction of recurrence rate in patients with diffuse large Bcell lymphoma. BioData Mining 14, 38 (2021). https://doi.org/10.1186/s13040021002729
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040021002729
Keywords
 DLBCL
 Risk prediction
 Probability calibration
 Discrimination and calibration