Data descriprion
The study uses the simulated breast cancer datasets, each having 693 observations (with varying percentages of missing data contained in independent variables only). The variables used in this article were extracted from several previous breast cancer-related studies [10, 13, 16, 17]. The dependent or outcome variable namely ‘cancer recurrence’ has two response categories; ‘yes’ and ‘no’. The response ‘yes’ means a breast cancer comes back after recommended therapy, ‘no’ indicates that the cancer does not come back after respective therapy. The independent or predictor variables are: the age, heart rate, respiratory rate, body mass index, body surface area.
Classification techniques
Classification techniques also known as classifiers are used to predict a categorical response for a case by assigning that case to a certain category. In this article, a binary logistic regression, linear, and quadratic discriminant classifiers [18, 19] were used to predict a group membership for breast cancer cases and classify to either response ‘recurrence’ or ‘non-recurrence’ based on patients’ demographic and clinical factors; ‘age, body mass index, body surface area, heart rate, and respiratory rate’.
Binary logistic regression
Binary logistic regression (BLR) model considers the probability that a response variable Y belongs to a particular group [18, 20]. In this article, the BLR model intended to predict and classify breast cancer cases to either recurrence or non-recurrence events given predictors. The model uses a logit or logistic function (a sigmoid function applied in binary classification). A sigmoid function takes predictors’ data (real numbers) and maps them to certain probability value (ρi) between 0 and 1.
$$logit\left({\rho}_i\right)=\log \left(\frac{\rho_i}{1-{\rho}_i}\right)={\beta}_0+{\beta}_1{X}_{1i}+{\beta}_2{X}_{2i}+{\beta}_3{X}_{3i}+{\beta}_4{X}_{4i}+{\beta}_5{X}_{5i}$$
for i = 1, 2, 3, …, 693, β0 is a constant term of the model, βj for j = 1, 2, …, 5 are the parameters of the model, ρi is the probability that ith patient has cancer recurrence, Xp for p = 1, 2, …, 5 are the predictors in the model (age, heart rate, respiratory rate, body mass index, body surface area).
The patients’ probabilities (ρi) of having breast cancer recurrence are estimated by the fitted logistic regression model. The (ρi) are then used for classification of patients into either of the two categories of breast cancer response variable (recurrence or non-recurrence) according to the cut-off point of 0.5 for classification; a value greater than 0.5 is classified as ‘recurrence’ event, otherwise classified as ‘non-recurrence’ event for a patient i (i= 1,2, …, 693). Mathematically, the predicted probability for any patient i is given by ρi(x):
$${\rho}_i(x)=\frac{\exp \left({\beta}_0+{\beta}_1{X}_{1i}+{\beta}_2{X}_{2i}+{\beta}_3{X}_{3i}+{\beta}_4{X}_{4i}+{\beta}_5{X}_{5i}\right)}{1+\exp \left({\beta}_0+{\beta}_1{X}_{1i}+{\beta}_2{X}_{2i}+{\beta}_3{X}_{3i}+{\beta}_4{X}_{4i}+{\beta}_5{X}_{5i}\right)}$$
Linear discriminant analysis
Linear discriminant analysis (LDA) is used to separate different sets cases and allocate new cases to previously defined groups [21]. It assumes that each group are drawn from Multivariate Normal distribution with group definite mean vector (μ) and covariance matrix (∑) [19]. As a classifier, LDA allocates cases x to group k provided that x is in falls inside the group or region k [22]. In this study we applied LDA to discriminate or separate the cases with breast cancer recurrence from those without recurrence by using the discrimination function, δk(x). An assumption about equal covariance matrix, ∑ among the k = 2 groups was made to make discrimination function taking a linear form in x. A Bayes’ theorem with grouping prior probabilities (π) is used to allocate x in group k for which a function.
\({\delta}_k\left(\boldsymbol{x}\right)={x}^T{\sum}^{-1}{\mu}_k-\frac{1}{2}{\mu}_k^T{\sum}^{-1}{\mu}_k+\log {\pi}_k\Big)\) is largest [19].
Quadratic discriminant analysis (QDA)
As classifier, QDA is like LDA, it makes assumption about Multivariate Normality for observations within each group; however, it doesn’t assume common covariance matrix, i.e., each group assumed to have its own matrix. Bayes’ theorem is utilized to make prediction and allocate cases into respective groups [21, 23]. Nevertheless, . Based on this assumption, a case x assigned to a group k for which the function.
\({\delta}_k\left(\boldsymbol{x}\right)=-\frac{1}{2}{\left(x-{\mu}_k\right)}^T{\sum}_k^{-1}\left(x-{\mu}_k\right)-\frac{1}{2}\log \mid {\sum}_k\mid +\log {\pi}_k\) is largest [19].
Since LDA and QDA the training and validation datasets, in this article we used 70 and 30% of the sample in each dataset as training and validation data respectively LDA is likely to perform better than QDA when the training cases are relatively few and hence reducing variance is vital. Meanwhile, QDA tends work better when number of training cases is very large, as a result the classifier’s variance does not harm, also it is a recommended classifier if covariance matrix is clearly common among the k groups [19]. .
Simulation settings
The simulation experiment was conducted to compare the performance of imputation methods and classifiers in presence of imputed missing values across different simulation conditions. The simulation of data was based on real breast cancer dataset of size 693 containing both observed and missing values for variables; X1, X2, …, X6 respectively, denoting the age, heart rate, respiratory rate, body mass index, body surface area, and recurrence of breast cancer. Data were sampled with fixed mean vector and covariance matrix of all variables while varying the percentages of missing data (i = 1 : 4), changing missingness mechanisms (j = 1 : 2), and imputation methods (k = 1 : 6) producing a total of 4 × 2 × 6 = 48 simulation conditions. The experimental design was built in four steps: (1) Generating complete data set, (2) Amputation (making incomplete data sets from the completed one), (3) Imputation (to fill-in missing data values), and (4) Evaluate the performance of each imputation techniques.
In step (1) we generated a complete dataset with size N = 693 observations from multivariate normal distribution [24, 25] with vector of means (μ), and a positive definite covariance matrix (∑) given below; archived by the application of ‘mvrnorm’ function in the package ‘MASS’ [26] of R statistical program.
$$\mu =\left[\begin{array}{c}\begin{array}{c}51.00\\ {}97.89\\ {}20.84\end{array}\\ {}27.71\\ {}\kern0.5em 1.69\\ {}\kern0.75em 0.31\end{array}\right]$$
$$\Sigma =\left[\begin{array}{cccccc}169.3& \kern1em 1.8& \kern0.5em 3.0& \kern0.5em 2.0& -0.2&\ 0.1\\ {}\kern1em 1.8& 486.7& \kern0.5em 9.7& -4.7& \kern0.5em -0.2& \kern0.5em 1.1\\ {}\kern1em 3.0& \kern1em 9.7&\ 37.5& -0.5& -0.01& \kern0.5em 0.2\\ {}\kern1em 2.0& -4.7& -0.5&\ 43.6& \kern1.5em 0.9& -0.2\\ {}-0.2& -0.2& -0.01& \kern1em 0.9& \kern1.5em 0.1&\ 0.01\\ {}\kern1em 0.1& \kern1em 1.1& \kern1.25em 0.2& -0.2& \kern1em 0.01& \kern0.5em 0.2\end{array}\right]$$
Step (2) performed the ‘amputation’ (i.e., generating missing data values from the complete data set, in step 1). This involved creating incomplete data sets with varying percentages of missing data (15, 30, 45, and 60%) and two missing data mechanisms (MAR and MCAR). The R function, ‘amput’ [27] was applied to meet this purpose. The third step was to impute the created missing data sets using the following distinct imputation techniques:
Mean imputation
The idea based on this approach is to replace missing data values by the variable’s average score from observed data [3, 13, 28]. Theoretically, the mean imputation is more appropriate when the amount of missing data is small whilst the sample size is large [7]. Mean imputation also known as ‘series mean’ (SMEAN) is calculated as \(\sum_{\boldsymbol{i}=\textbf{1}}^{\boldsymbol{n}}{\boldsymbol{x}}_{\boldsymbol{i}}/\boldsymbol{n}\) where xi is a numerical variable and i = 1, 2, …, n; number of patients with observed covariate’s data values.
Hot deck imputation
This technique replaces each missing data in a variable by the observed data from a patient with ‘identical response’or data in the same variable [28]. The patient with missing data (un-respondent) is known as ‘recipient’ while the responded one with observed data is called the ‘donor’ [7, 29]. This method works better with MCAR or MAR data mechanisms [7]. Consider the values xi = (xi1, …, xip ) for subject i of p covariates. For a matching recipient i and a donor j, the proximity of potential candidate donors to recipients is defined by maximum deviation given by D(i, j) = maxk|xik − xjk|, for nicely scaled xk so that the comparability of difference) can be made [29]. In this study the hot deck imputation was implemented by using function ‘hot deck’ from the ‘VIM’ (Visualization and Imputation of Missing Values) package [30] in R statistical software (version 3.6.3).
K-nearest neighbour (KNN)
A non-parametric approach used to impute missing data by averaging its neighbouring observed data [14]. The approach is donor-based in which imputed values are either measured as a single records in the dataset (1-NN) or as an average value obtained from k records (k-NN) [31]. The distance between two observations that is used to define the nearest neighbours is defined as\({D}_{ij}=\frac{\sum_{k=1}^P{w}_k{\tau}_{i,j,k}}{\sum_{k=1}^P{w}_k}\), where wk is the weight and τi, j, k is the contribution of kth variable. The ratio of absolute distance to range is used for τi, j, k of continuous variables; \({\tau}_{i,j,k}=\frac{\mid {x}_{i,k}-{x}_{j,k}\mid }{r_k}\), whereas xi, k is a value of kth variable of ith observation and rk is the range of kth variable (30). In this study, the hot deck imputation method was performed by using function ‘hot deck’ from the ‘VIM’ (Visualization and Imputation of Missing Values) package [30] in R statistical software (version 3.6.3).
Multiple imputation by chained equations (MICE)
MICE Replaces each missing data with set of P acceptable values [32]. The method works with MCAR or MAR missing data mechanisms. The technique helps to remove potential selection bias which would result if observations with missing values were deleted from the dataset. Moreover, the chance of getting biased standard errors is also reduced [7, 8]. The procedure for the multiple imputations involves the following three steps: ‘imputing data m times, analysis of m imputed datasets, and pooling of results’ [6]. We applied R statistical software to perform these three steps in MICE by storing the results from each in the three special classes; mids (multiple imputed data sets), mira (multiple imputed repeated analysis), and mipo (multiple imputed pooled results) [33].
Predictive mean matching (PMM)
The PMM utilizes both parametric and non-parametric approaches to impute missing data. The parametric aspect, establishes a predictive mean value corresponding to each observation in data. These predictive means are then used to match complete and incomplete observations during imputation process. The non-parametric stage applies the method of Nearest Neighbour Donor to produce original data value from non-missing observation having nearest predictive mean distance close to missing one so as to impute a missing data value [34, 35]. The function and package ‘mice’ in R statistical software [33] was used to perform the PMM imputation five times, storing results from five complete datasets, and combining the results from five analysed datasets.
Expected maximization via bootstrapping (EMB)
The EMB is a bootstrap-based algorithm used to impute missing data multiple times. It utilizes existing sample data of size n to make new M samples of size n with replacement [36]. Since EMB is a multiple imputations-based procedure, it performs better under MAR assumption. It is based on E-M algorithm, summarized as follows: Firstly, the EM (Expectation-Maximisation) algorithm make use of certain distribution and propose starting/initial values for mean, μ and covariance matrix, ∑ that are then used to calculate an expected value of model’s likelihood. This likelihood is maximised and parameters of the model are estimated and updated. The steps of expectation and maximisation are repeated until convergence of the values is reached [36, 37]. The implementation of EMB in done in Amelia II program starts with bootstrapping an incomplete dataset to generate several bootstrapped datasets, followed by E-M process of these data to imputed datasets and then the imputed datasets are analysed separately by standard statistical method, and the results are combine to provide single final results.
Evaluation of imputation methods
The evaluation of applied imputation techniques was through comparison of Root Mean Squared Errors (RMSE) obtained from each imputation techniques under various simulations conditions. For each imputation method, we also report classification accuracy resulted from three classification methods via estimated area under the Receiver Operating Characteristics (ROC) curves resulted from logistic regression using imputed data sets at 15, 30, 45, and 60% missing data under MAR and MCAR mechanisms.