 Research
 Open access
 Published:
Electronic medical records imputation by temporal Generative Adversarial Network
BioData Mining volume 17, Article number: 19 (2024)
Abstract
The loss of electronic medical records has seriously affected the practical application of biomedical data. Therefore, it is a meaningful research effort to effectively fill these lost data. Currently, stateoftheart methods focus on using Generative Adversarial Networks (GANs) to fill the missing values of electronic medical records, achieving breakthrough progress. However, when facing datasets with high missing rates, the imputation accuracy of these methods sharply deceases. This motivates us to explore the uncertainty of GANs and improve the GANbased imputation methods. In this paper, the GRUD (Gate Recurrent Unit Decay) network and the UGAN (Uncertainty Generative Adversarial Network) are proposed and organically combined, called UGANGRUD. In UGANGRUD, it highlights using GAN to generate imputation values and then leveraging GRUD to compensate them. We have designed the UGAN and the GRUD network. The former is employed to learn the distribution pattern and uncertainty of data through the Generator and Discriminator, iteratively. The latter is exploited to compensate the former by leveraging the GRUD based on time decay factor, which can learn the specific temporal relations in electronic medical records. Through experimental research on publicly available biomedical datasets, the results show that UGANGRUD outperforms the current stateoftheart methods, with average 13% RMSE (Root Mean Squared Error) and 24.5% MAPE (Mean Absolute Percentage Error) improvements.
Introduction
Electronic medical records are often lost due to equipment failures, data transmission interruptions, and other reasons [1]. As a result, the final collections of electronic medical records are often sparse and irregular. To fill in the lost values in electronic medical records, most stateoftheart methods currently employ Generative Adversarial Networks (GANs) [2], which can learn the distribution of the original dataset and generate imputation values. However, when the missing rate of the dataset is high, there is a significant deviation between the learned data distribution by GANs and the actual data distribution, which leads to a sharp decrease in the accuracy of missing value imputation. Figure 1 shows the missing situation of Healthcare [3], a publicly available dataset of electronic medicine.
In Fig. 1, there are a total of 42 physiological attributes and 196,000 records. That is to say, In Fig. 1a, BUN, Bilirubin, Cholesterol, Creatinine, DiasABP, FiO2, GCS, Glucose, HCO3, HCT, HR, K, Lactate, MAP are physiological attributes, and the ordinate represents the serial number of records in the dataset. Obviously, the values of Bilirubin, Cholesterol, and HCO3 are the most severely lost. For the highmissingrate dataset, the imputation errors of the stateoftheart GANbased methods are quite high, as shown in Fig. 1b. Our method achieved good performance in Fig. 1c, achieving an average improvement of 13.0% RMSE and 24.5% MAPE, where the abscissa is the missing rates and the ordinate is the errors.
From Fig. 1, it is apparent that the Healthcare dataset has mass missing values. When using and storing this dataset, if the "deletion" method [4] is employed, almost all the records in the dataset will be deleted. If the "mean" or "zerovalue" imputation method [5] is utilized, the filled dataset will differ significantly from the original dataset. If the time series relation is exploited for imputation, it is impossible to establish an effective time seriesbased prediction model due to the mass missing values. If the GANs are utilized to learn the distribution of the original data, although the imputation accuracy is somewhat improved, it still does not reach the level of practical application.
Given those practical issues, in recent years, data with high missing rates imputation based on GAN network has been increasingly studied, where multivariate time series GAN network is a research hotspot. Early work attempted to employ GANs to learn the distribution patterns of multivariate electronic medical records [6, 7]. In recent years, methods combining multivariate time series data mining and GANs for missing value imputation have emerged. For example, Miao et al. [8] explored the time series classification method and the GANs model, and proposed a semisupervised GAN imputation approach. Cao et al. [9] investigated the Recurrent Neural Networks (RNNs) model and proposed a time series data imputation method based on bidirectional RNN. Wang et al. introduced the attention mechanism [6] and proposed the STAGAN model [10]. Based on these, Benchekroun et al. [11] studied the characteristics of heart rate variability physiological data with high missing rates, and applied several missing value imputation methods to fill these data.
Although proven to be effective, the uncertainty of GANs has not been considered, nor has the role of the GRUs (Gate Recurrent Units) based on time decay been explored in missing value imputation. This could potentially provide another method for electronic medical records imputation. This has motivated us to explore the utilization of time decay compensation and the UGAN (Uncertainty Generative Adversarial Network), which allows traditional GANs and GRUs to work together and form a new missing value imputation method, UGANGRUD (Uncertainty GANGate Recurrent Unit Decay). In UGANGRUD, to overcome the challenge of capturing the distribution pattern of highmissingrate datasets, we introduce the uncertainty matrix unit U into GAN to form the UGAN, which is an improvement not considered in existing methods. To utilize the time interval information and U in the dataset, we introduce the time decay factor \({v}_{t}\) into GRUs to form the GRUD (Gate Recurrent Unit Decay) network. We propose a dualnetwork collaborative training mechanism, where the uncertainty matrix U in GAN is outputted and utilized to guide the training of GRUD. Compared to the current stateoftheart methods, our approach better captures the distribution of highmissingrate datasets and performs more accurate imputation. Experimental results demonstrate that our method outperforms existing stateoftheart methods.
In summary, the main contributions of this paper can be summarized as follows:

(1)
We propose the UGANGRUD model for the first time, where UGAN is employed to learn spatial distribution patterns and GRUD is leveraged to learn time series patterns. The combination of the two improves the imputation accuracy of highmissingrate datasets.

(2)
We propose an improved method for GAN, which includes Generator G, Discriminator D, and uncertainty matrix U, called UGAN, which can capture the distribution patterns and uncertainty of the dataset more accurately. We also propose an improved method for GRU based on time decay factor, called GRUD, which can further improve the imputation accuracy.

(3)
We theoretically and experimentally demonstrate that the proposed UGANGRUD achieves better performance, and also discuss the impact of dataset dimensions on UGANGRUD.
The organization of the paper is as follows. Related work is introduced in Sect. 2. The proposed UGANGRUD model is detailed in Sect. 3, including the architecture of the model and the design of its components. In Sect. 4, experiments are conducted on three publicly available electronic medical record datasets, and the results are compared and analyzed. Section 5 concludes the study and discusses future research directions.
Related work
In addition to the current stateoftheart missing value imputation methods based on GANs, there are many other missing value imputation methods. In this section, we conduct a literature review of missing value imputation methods from four aspects: statistics, machine learning, deep learning, and electronic medical records.
Imputation methods based on statistics
The imputation method based on statistics refers to filling the missing values with statistics method, such as the methods "constant ", "mean", and "sampling". For example, Park et al. [4] adopted the missing value imputation method based on "constant" in analyzing the sleep data. Robertson et al. [5] designed a missing value imputation method based on "mean". Further, Nickerson et al. [12] designed a missing value imputation method based on adjacent observations. Zhang et al. [13] modeled the probability distribution of data changes and used the probability distribution model to predict missing values. Later, Singh et al. [14] investigated statistical sampling and sample estimation, and proposed a method for filling missing values based on continuous "sampling". Therefore, the imputation methods based on statistics are suitable for discrete data imputation, and the imputation effect is better when the data follow a normal distribution.
Imputation methods based on machine learning
Imputation methods based on machine learning include KNearest Neighbor (KNN) algorithm, shallow neural network method, and Matrix Factorization (MF) method, etc. For example, Ma et al. [15] proposed a missing values imputation method based on KNN clustering. Shallow neural network is an early form of neural network model, whose network structure is simple and the number of layers is small. Chen et al. [16] investigated the overfitting problem of neural networks, and proposed a neural network method based on steams. Tang et al. [17] employ a fuzzy neural network to classify the data, followed by the KNN method to predict the amount of missing values in each category, and finally utilize fuzzy rough sets to fill in missing values. The MF algorithm attempts to reconstruct the original data by matrix factorization to find the correlation between the data. In recent years, methods based on MF have been introduced into the time series data imputation. Generally, MFbased methods decompose a data matrix into two lowdimensional matrices, and then attempt to reconstruct the original matrix. During the process of matrix reconstruction, missing values are filled in. Fernandes et al. [18] proposed a MFbased method for filling missing values in multivariate time series data, and smoothed the filled values and observed values. Rios et al. [19] exploited machine learning methods for cardiovascular disease prediction and evaluated seven methods for filling in missing values. The imputation methods based on machine learning usually rely on prior knowledge of the data, which makes it difficult to deal with the potential rules in the data. In addition, most of the machine learningbased imputation methods emphasize the structure of data, so it cannot handle the unstructured data well.
Imputation methods based on deep learning
The imputation methods based on deep learning exploit the powerful learning function of deep neural network to learn potential rules from the dataset, and then complete the prediction of missing values. RNNs can process time series data through iterative and scalable neurons, which can well remember the sequence relation of time series data, so as to effectively fill the missing values. Ouyang et al. [20] used RNN to learn the relation between data and time, and then utilized neural networks to predict missing values. Cao et al. [9] proposed a supervised learningbased time series imputation model named BRITS. BRITS assumes that all the labels of time series data are complete, and therefore, data without labels are discarded during the training. It is worth noting that in the datasets with high missing rates, BRITS usually results in severe overfitting due to the sharp reduction of training samples. Shukla et al. [21] improved the weights of BRITS and proposed the AUCOA model. GAN can generate new data from the distribution of the original data. Considering that the missing data and the nonmissing data in the dataset follow the same distribution law, the data can be generated by the GAN to fill in the missing values. Yoon et al. [6] proposed a model GAIN that fills missing values through GAN. GAIN exploits the Generator to learn the distribution law of the original data with missing values, and leverages the Discriminator to judge the missing values produced by the Generator. Miao et al. [8] proposed a semisupervised GAN model named SSGAN for missing values imputation. In SSGAN, a semisupervised classifier is designed to iteratively classify unlabeled time series data and make the Generator produce predictions for missing values. The methods based on deep learning have greatly improved the accuracy of missing values imputation. However, for the datasets with high missing rates, the accuracy is still not high.
Electronic medical record imputation
The problem of electronic medical record imputation is a clinical applicationoriented issue that has gone through three stages of development. Early electronic medical record imputation employed traditional zerovalue imputation methods [4], followed by the adoption of machine learning methods [19] for imputation. Currently, most electronic medical record imputation employs deep learning methods, such as the study by Zheng et al. [22] on predicting mortality risk, which utilized the LSTMRUN model to fill missing values. Experimental results show that this method is effective, where LSTM is a special type of recurrent neural network. Shi et al. [23] applied GRU to the learning of clinical time series data and found that the GRUbased method is a fast missing value imputation method, with GRU being a simplified version of LSTM. The latest electronic medical record imputation methods are based on GAN networks, but GAN networks have only been applied in a simplistic manner [24, 25]. That is to say, these methods have not considered the specific needs of the electronic medical record field, and thus have not improved GAN networks accordingly. As a result, these methods have led to the problem of high missing rates not being adequately addressed.
To sum up, since the data are seriously missing, if the method "deletion" is utilized, almost all records in the dataset will be removed; if the method "mean" or "zero" is employed to fill, the filled dataset will be very different from the original dataset; if the time series method is exploited to fill, a timebased prediction model cannot be established due to the seriously data missing; if GANs are employed to fill, there will be a large deviation between the learned data distribution law and the real data distribution law. Therefore, the existing methods cannot effectively handle the serious problem of electronic medical record imputation with high missing rates.
Imputation based on temporal GAN
An imputation method for electronic medical records based on GAN and temporal relation is proposed. The method first exploits GAN to learn the true distribution of the original data, and fill in the missing values with the generated data. Then, it leverages the time relation to rectify the filled values.
Problem descriptions

(1)
Highdimensional electronic medical record: refers to electronic medical record that contain multiple medical features.
Let \(x={\{x}_{0},{x}_{1},\dots ,{x}_{n1}\}\in {\mathbb{R}}^{d\times n}\) denote electronic medical record dataset, where \({x}_{0}\) represents the observation value of x at time t_{0}, and \({x}_{1}\) represents the observation value at time t_{1}, and so forth. Each observation value includes d features, for example, \({x}_{0}^{j}\) represents the j^{th} feature value of \({x}_{0}\). In general, when d > = 3, x is highdimensional electronic medical record dataset.

(2)
Missing value mask matrix: used to mark the missing status of highdimensional electronic medical records.
Let \(m({m}_{i}^{j})\in {\mathbb{R}}^{d\times n}\) mark the missing status of electronic medical record dataset x, then,
where \({{\varvec{m}}}_{{\varvec{i}}}^{{\varvec{j}}}\) is a flag in the mask matrix, and 0 means missing, 1 means normal.

(3)
Missing interval matrix: used to mark time intervals.
Let \({\varvec{\delta}}({{\varvec{\delta}}}_{{\varvec{i}}}^{{\varvec{j}}})\in {\mathbb{R}}^{d\times n}\) denote the time interval matrix of electronic medical record dataset x, then,
where \({\varvec{\delta}}({{\varvec{\delta}}}_{{\varvec{i}}}^{{\varvec{j}}})\) is employed to compensate for time decay, and \({{\varvec{m}}}_{{\varvec{i}}1}^{{\varvec{j}}}\) is the element in the mask matrix.\({t}_{i}\) and \({t}_{i1}\) are time.
The task of missing value imputation can be described as: based on the given dataset of highdimensional electronic medical record x, missing value mask matrix m, and missing interval matrix δ, establish a missing value imputation model, and predict the missing data.
UGANGRUD model
To overcome the imputation problem faced by highmissingrate electronic medical records, we propose a missing value imputation model based on uncertainty matrix and time decay factor, viz., UGANGRUD. In the model UGANGRUD, in order to alleviate the problem of learning the distribution law of electronic medical records with high missing rates, we propose a control network UGAN based on the uncertainty matrix U, where U is the difference between the generated data and the original data, which represents the uncertainty of GAN. Due to the high missing rates of the original dataset, U changes drastically, and its values are uncertain. Considering the accuracy and diversity of imputation, we propose the GRUD based on time decay factor, where the time decay factor is an operator that uses time order and time interval to correct the filled data, which is a function of \({u}_{t}\) (\({u}_{t}\in U\)). The illustration of UGANGRUD is shown in Fig. 2.
In the proposed method, the potential distribution of electronic medical records is captured by the Generator, and the output data of the Generator are judged and optimized by the Discriminator. The Generator and the Discriminator form two opposing sides, so that they constantly optimize themselves and improve their ability to generate or discriminate. Eventually, the neural network becomes stronger during the training process. In the time decay compensation process on the right side of Fig. 2, we exploit the temporal dependencies between GRU units and the attenuation matrix to rectify the filled values of UGAN. Since the time intervals between missing values are not necessarily equal, it is necessary to obtain the information of time intervals. UGANGRUD not only considers the correlation of features and the uncertainty of GANgenerated data, but also exploits the correlation between time.
UGAN
The data generated by ordinary GANs is not accurate for filling in missing values in electronic medical records with high missing rates. To alleviate this issue, we propose an uncertainty matrixbased control network UGAN that takes into account the dynamics of the data distribution.
Unlike the ordinary GANs, UGAN consists of G, D, and U. The input of G is not only z, but a combination of z, x, and m, where x is the original input, z is a random matrix based on x, and m is a mask matrix based on x. In UGAN, to improve the optimization speed of the neural network, tanh() is selected as the activation function by G and D. The raw data are normalized, which are mapped between [1.0, 1.0].
At a certain moment, the input of G is x_{t}, m_{t}, and z_{t}, and the output is the data distribution matrix DDM, where DDM consists of a series of estimated values \({\widehat{x}}_{t}\).
where, \(\odot\) is the elementwise multiplication and \({\widehat{x}}_{t}\) is the estimated value of the original input vector \({x}_{t}\). Regardless of whether there are missing values in \({x}_{t}\), G will generate the estimates in its corresponding dimensions, that is, the nonmissing values in x have also corresponding estimates. It should be noted that zero is utilized as a placeholder for missing values in the dataset before the neural network is trained.
To rectify the values of the DDM, it is necessary to replace the corresponding values in the DDM with the nonmissing values in x, as shown in Eq. (4).
where \({\overline{x} }_{t}\) is the corrected vector, m_{t} is the corresponding mask vector, \({x}_{t}\) is the corresponding original input vector, and \({\widehat{x}}_{t}\) is the output of Eq. (3). In order to measure the accuracy of the data generated by G, an uncertainty matrix U = \(\{{u}_{1},{u}_{2}, ...,{u}_{t}\}\) is introduced, where U is the difference between the generated vector \(\widehat{x}\) and the original data vector x, that is, at time t, the error between \({x}_{t}\) and \({\widehat{x}}_{t}\) can be calculated by Eq. (5).
In Eq. (5), d is the dimension of multivariate time series data at time t, and k is the sum of the observations at time t. Since values in some dimensions at time t may be missing, \(d\ge k\). \({u}_{t}\) represents the uncertainty of the filled data at time t, and it will be further exploited in subsequent neural networks GRUD.
D is responsible for judging the accuracy of the generated data. The main task of D is to calculate a probability value between 0 and 1 based on the true label, the original input, and the generated data. We make UGAN call the Discriminator twice, one for real data discrimination and the other for fake data discrimination. The different outputs of the Discriminator are leveraged to calculate the loss values of Generator and Discriminator. Finally, the parameters of the neural network are updated using the backpropagation mechanism. In Algorithm 1, UGAN is described in more detail.
During the training of UGAN, samples need to be extracted from the training dataset, and these samples will be utilized to generate the minibatches used in the iterations, denoted as \(\widetilde{x}\), \(\widetilde{m}\) and \(\widetilde{e}\). Briefly, the main steps of the algorithm UGAN are as follows.

(1)
Take the samples \(\widetilde{x}\), \(\widetilde{m}\), and \(\widetilde{e}\);

(2)
Generate the data \(\widetilde{x}\) according to Eq. (3);

(3)
Calculate the loss function of D using Eq. (6);

(4)
Calculate the loss function of G using Eq. (7)

(5)
Repeat the training within a given number of iterations (n_iter);

(6)
Obtain the electronic medical record dataset with filled values after training UGAN.
GRUD
In order to further rectify the missing values filled by UGAN, we propose an iterative and scalable neural network structure GRUD. In GRUD, by introducing a time decay factor, the missing data will be filled differently according to the time of its missing, which increases the diversity of missing value imputation. GRUD provides corresponding information by memorizing the sequential relationship and historical time information of time series data.
For electronic medical records, the issue of missing for a long time often arises [24, 25]. For the longterm missing of electronic medical records, we attenuate the historical memory vector according to the length of the missing time: if the missing time is long, due to the principle of forgetting, the historical information has little influence on the current status, so the historical memory vector should be attenuated greatly; otherwise, if the missing time is short, the historical memory vector should undergo a small decay. In order to adapt to the missing time intervals of electronic medical records, we propose the GRUD based on a time decay factor, as shown in Fig. 3.
The time decay matrix is composed of time decay factors, which exploits the sequential and historical information between time that can finely fill in the missing data. Specifically, the time decay matrix \(V({v}_{t})\) is calculated by Eq. (8).
where \({W}_{u}\) is the weight parameter, \({b}_{u}\) is the bias vector, \({u}_{t}\) is the error between \({x}_{t}\) and \({\widehat{x}}_{t}\) at time t, and the range of \({v}_{t}\) is [0, 1]. \({u}_{t}\) is exploited in \({v}_{t}\). \({u}_{t}\) is the deviation between the vector generated by UGAN and the original data vector, which can be employed to further improve the diversity and accuracy of the filled values. Therefore, \({u}_{t}\) is introduced into the GRUD to further fill in the missing data by using the time associations.
Obviously, \({v}_{t}\) is leveraged to highlight the reliability of the generated imputation values, which can rectify the attention of the large biased data generated by G. The estimated value \({x}_{t}^{r}\) of the current sequence can be predicted from the hidden layer state \({\widehat{h}}_{t1}\).
Based on \({v}_{t}\), \({x}_{t}^{r}\) and \({\overline{x} }_{t}\) are combined to obtain the estimated value of GRUD, as shown in Eq. (10).
Finally, replace the missing values with the estimated values \({c}_{t}\) to get the complete vector \({x}_{t}^{c}\), as shown in Eq. (11).
Additionally, the "∘" operator needs to be leveraged to concatenate the complete vector with the corresponding mask vector. For the hidden state \({h}_{t1}\), \({v}_{t1}\) is employed for processing to get \({\widehat{h}}_{t1}\). Therefore, the update of hidden state at time t, \({h}_{t}\), is shown in Eq. (12).
where, \(\sigma\) represents the activation function, \({W}_{h}\) and \({P}_{h}\) are the weight parameters, and \({b}_{h}\) is the bias vector. The specific definition of the loss function is shown in Eq. (13).
In Eq. (13), \({\mathcal{L}}_{MAE}\) denotes the mean absolute error loss, and the meanings of \({x}_{t}\), \({m}_{t}\), and \({c}_{t}\) are the same as those described above. Algorithm 2 describes the entire procedure of GRUD in detail.
The model UGANGRUD includes two parts, one is the deep neural network UGAN, and the other is the deep neural network GRUD. The former learns the distribution law of the original dataset through the Generator, guides the Generator through the Discriminator, and records the deviation of the filled values through the uncertainty matrix. The latter exploits GRUD to memorize the sequence relations and historical time information of time series data, and then employs the learning function of deep neural network to discover the correlations between data. Finally, the target of improving the imputation accuracy for the datasets with high missing rates is achieved.
Experiments and analysis
To validate the model UGANGRUD, we conducted three aspects of experimental studies: (1) the performance study, (2) the ablation study, and (3) the efficiency study. Like existing methods [4, 5, 8, 10, 12, 21], we performed the same dataset selection and experimental parameter settings.
Experimental datasets and baseline models
To verify the effectiveness of the UGANGRUD model, three publicly available ehealth datasets, Healthcare [3], PerfDS1 [26,27,28] and PerfDS2 [28] were used. Those electronic medical records are the data on human physiological indicators [3, 28]. The datasets are provided by the intensive care units and community hospitals, and the indicators involved include body temperature, heart rate, blood sugar content, electrocardiogram, and so forth. The Healthcare dataset has a total of 4,000 records, each 24–36 h long, and belongs to multivariate time series data. Most of the records of Healthcare dataset are incomplete (components missing), it has an average missing rate of 80.67%, and the related main task is to classify patients. The PerfDS1 dataset has a total of 90,000 records, and its average missing rate is 50%, whose continuous missing problem is serious. The PerfDS2 dataset has a total of 12,000 records, and its average missing rate is 13%, and there is obvious periodicity in these data.
According to the experiments of the current stateoftheart methods [8, 21], the division ratio of the training dataset and the test dataset is 7:3, and they are used for training and testing, respectively. Since the missingvalueimputation based on traditional statistical methods does not require training, it directly enters the testing phase. In order to simulate the mass missing phenomenon, secondary missing processing is required. The method of secondary missing processing [6] is to randomly select a record, and if it is a complete record, delete it and mark it as missing data; and if it is a record with missing values, select next record to handle. We employ a normal distribution with a random seed of 1024 to randomly select the serial number/position of the record in the dataset.
In the research, models such as Zero [4], Mean [5], Last [12], KNN [29], STAGAN [10], AUCOA [21], SSGAN [8] were selected as the baseline methods for comparisons.

∎ Zero[4] model: This is a classic model that features the use of 0 to fill in missing values.

∎ Mean [5] model: This is also a widely used classic model, characterized by using the global average to fill in missing values.

∎ Last [12] model: This is a widely used model in the field of behavioral data mining, which features the use of the last observations to fill in the missing values.

∎ KNN [29] model: It is also called the KNearest Neighbor imputation algorithm, which is characterized by using the KNN algorithm to find the samples with "near neighbor", and then employing the weighted average of the "near neighbor" samples to fill in missing values.

∎ STAGAN model [10]: This is a missing value imputation model based on GAN network, which fills missing values through the Hint Matrix mechanism [18].

∎ AUCOA [21] model: This is a timeseries neural network model that is characterized by bidirectional training of data. One direction arranges the data and trains them along time increments, and the other direction arranges the data and trains them in decreasing time. Experiments showed that this bidirectional training method can improve the accuracy of missing value imputation of timeseries data. SSGAN [8] model: This is an improved GAN network model that is characterized by iteratively classifying unlabeled time series data through a semisupervised classifier, which in turn assists the Generator to estimate missing values by using these classified data.

∎ SSGAN [8] model: This is an improved GAN network model that is characterized by iteratively classifying unlabeled time series data through a semisupervised classifier, which in turn assists the Generator to estimate missing values by using these classified data.

∎ UGANGRUD model: The method proposed in this paper.
Since the problem solved by some baseline methods is the missing value imputation for the general purpose domain, and the problem we are solving is the missing value imputation for the biomedical field, we utilize the datasets of the biomedical field [3, 28] to recompare these methods. In the experiments, based on the characteristics of the datasets, we utilized a normal distribution to initialize the parameters in the models. In addition, as in Ref. [8,9,10], the neural network models were set a Batch Size of 128 and an Iterative Period (epoch) of 1000; The Adam optimizer was chosen for stochastic gradient descent training with a learning rate of 0.001, and the Sigmoid was chosen as the activation function to map variables between 0 and 1. To prevent the distribution of the dataset from adversely affecting the training process, all data were normalized so that their means were zero.
Evaluation criteria
To facilitate evaluation and comparison, the Root Mean Squared Error (RMSE) [30] and the Mean Absolute Percentage Error (MAPE) [31] between the groundtruth values and the predicted values, are adopted as the evaluation criteria in this paper, as shown in Eqs. (14) and (15).
In Eq. (14) and (15), n represents the number of samples, and \({y}_{i}\) and \({y}_{i}^{\prime}\) denote the groundtruth value and predicted value at time i, respectively. RMSE and MAPE represent the gap between the original data and the filled data. The smaller the RMSE and MAPE, the better the performance.
To evaluate the classification effect of the filled data, the Area Under Curve (AUC) metric is adopted in this paper, as shown in Eq. (16). The metric AUC represents the area under the Receiver Operating Characteristic (ROC) curve. The metric AUC is not sensitive to the proportion of positive and negative samples, so the metric AUC can better distinguish the pros and cons of the binary classification models [9].
where \({D}^{+}\) represents the set of all positive samples, \({D}^{}\) represents the set of all negative samples, and \(f({x}^{+})>f({x}^{})\) indicates that the prediction result of positive sample \({x}^{+}\) is better than that of negative sample \({x}^{}\).
Performance study
Performance of imputation
In the experiments, we implemented all the evaluation testbeds using PyTorch. To evaluate the missing values imputation performance of UGANGRUD, it is necessary to select homogeneous and comparable methods. In this paper, Zero, Mean, Last, KNN, STAGAN, AUCOA, SSGAN were selected as comparison methods. At the same time, in order to reflect the processing effect of the highmissingrate datasets, the datasets Healthcare, PerfDS1 and PerfDS2 were treated with secondary missing, and the missing positions of records were randomly selected according to the normal distribution. As references [6, 8, 9], the "underscore" identification method was introduced to mark the top three models that performed better, and "bold" was used to mark the model that performed best in the experiments. Table 1 shows the imputation performance of different models on the dataset PerfDS1 with different missing rates.
It is easy to see from Table 1 that the UGANGRUD model achieves the best performance. Compared with the model Zero, the performance of UGANGRUD is greatly improved by 50%. The model AUCOA has the second performance, but its performance of imputation drops drastically as the missing rate increases. UGANGRUD has an average improvement of 36.2% in RMSE and 39.4% in MAPE compared to AUCOA. UGANGRUD has an average improvement of 39.2% in RMSE and 41.8% in MAPE compared to SSGAN. Table 2 shows the imputation performance of different models on the dataset PerfDS2 with different missing rates.
It is easy to see from Table 2 that UGANGRUD performs the best on the criteria RMSE. From an average performance perspective, UGANGRUD has an average improvement of 19.7% compared to AUCOA and 22.8% compared to SSGAN. In terms of MAPE indicators, UGANGRUD is slightly lower than AUCOA, because (1) the periodicity of the PerfDS2 dataset is better, and it is likely that UGAN and GRUD destroy the original time series laws of the data; (2) The initial missing rate of the PerfDS2 dataset is relatively low, which makes the advantages of the UGANGRUD method impossible to play to a certain extent. This indicates that the UGANGRUD method is more suitable for datasets with random distribution and high missing rates. Table 3 shows the imputation performance of different models on the dataset Healthcare with different missing rates.
It can be seen from Table 3 that the UGANGRUD model can still achieve better performance under the condition of large data loss, where the initial missing rate of the Healthcare dataset is 80.67%. Other models that achieved better performance were STAGAN and SSGAN, with SSGAN in second place and STAGAN in third. UGANGRUD improved RMSE by an average of 7.7% and MAPE by 8.1% over STAGAN model. UGANGRUD improved RMSE by an average of 4.3% and MAPE by 19.0% over SSGAN model.
Analysis: From the performance experiments of imputation, it can be seen that the UGANGRUD model performs well on the datasets Healthcare, PerfDS1, and PerfDS2. It should be noted that the comprehensive missing rates of the datasets Healthcare and PerfDS1 are relatively high, while the comprehensive missing rate of the dataset PerfDS2 is relative low. This indicates that the UGANGRUD model is not only a method suitable for highmissingrate datasets, but also has certain reference value for common missing rate datasets.
Performance of classification and regression
Since the ultimate purpose of electronic medical record imputation is to support decisionmaking, the performance of classification and regression of the filled data need to be evaluated. Like references [6, 8, 9], we constructed a RNN classifier and a RNN regression predictor, and trained the models using the filled dataset. The number of training iterations is 30, the learning rate is 0.005, the dropout is 0.5, and the dimension of the hidden state in the RNN is 64. The evaluation criteria used for the classification is AUC, and the evaluation criteria used for the regression prediction is RMSE.
The Healthcare dataset was used for the training and testing process of the classification task with a number of 30 classes. The PerfDS1 dataset and the PerfDS2 dataset were used for the training and testing process of the regression task. Figure 4 is the classification performance based on the Healthcare dataset.
It is easy to see from Fig. 4 that the classification performance is the best after the dataset is filled in by the method UGANGRUD, which is 19.2% higher than KNN method and 1.4% higher than SSGAN method. Figure 5 shows the regression task performed on the filled PerfDS1 dataset, where the smaller the RMSE, the better the regression effect.
Figure 6 shows the effect of performing regression tasks on the filled PerfDS2 dataset, where the smaller the RMSE, the better the regression effect.
It is easy to see from Fig. 5 and Fig. 6 that the regression effect is different after the dataset is filled by different methods, among which UGANGRUD corresponds to the best effect, followed by SSGAN, and STAGAN third. Classification and regression effects are inseparable from imputation effects, for example, UGANGRUD, SSGAN, AUCOA, STAGAN methods with better imputation effects, and their corresponding classification and regression effects are also better. The imputation of dataset is a meaningful endeavor.
Ablation study
Ablation experiments
To explore the impact of various improvements in the UGANGRUD model on performance, an ablation study is required. This means removing the improved parts in the UGANGRUD model and observing changes in model performance. The key improvements in the UGANGRUD model are twofold, namely UGAN and GRUD. We utilized a GANbased model [6] as the "Base" model. Then, we added GRUD to the "Base", called "Base + GRUD"; and we added UGAN to the "Base", called "Base + UGAN". Finally, we added both of these key improvements together, called "Base + GRUD + UGAN", which is also the UGANGRUD model. All neural network parameters were initialized with the same values. Table 4 shows the ablation study results of the UGANGRUD model.
Analysis

(1)
From Table 4, it can be seen that after adding GRUD to the "Base", the model's performance is improved. Since GRUD can mine the correlation from time series data, this indicates that GRUD helps to improve the accuracy of imputation. Similarly, after adding UGAN to the "Base", the model's performance is significantly improved. This shows that using an uncertainty matrix to capture the distribution of high missing rate datasets is an effective method.

(2)
Additionally, when both GRUD and UGAN are added to the "Base", the model's performance reaches its optimum. This indicates that the key improvements GRUD and UGAN are not only effective individually but also when combined, the overall performance can reach its best. In summary, the improvements of the UGANGRUD model are all effective, making it a competitive model.
Data dimension study
The impact of data dimension on the model refers to the impact of the number of features included in the dataset on the model. To explore the impact of data dimension on the model UGANGRUD, it is necessary to select a dataset with a larger number of features in the original dataset. Due to the large number of features in the dataset PerfDS1, the PerfDS1 was selected to verify the impact of the data dimension on UGANGRUD. RMSE and MAPE are still chosen as the evaluation criteria. Figure 7 is the experiments of the impacts of data dimension on the model UGANGRUD.
Analysis: From Fig. 7, it is easy to see, the two evaluation criteria RMSE and MAPE show relatively stable trends with the change of data dimensions. Since the RMSE and MAPE of the model UGANGRUD do not change significantly with the changes of the data dimensions, that shows the data dimension has less impact on the model UGANGRUD. In addition, different missing rates have a certain impact on performance. For example, when the missing rate is greater than 80%, the data dimension will have an impact on the model UGANGRUD as shown in Fig. 7.
Efficiency study
In order to evaluate the efficiency of model training, we compared the training time of the four models: STAGAN, AUCOA, SSGAN, and UGANGRUD. Table 5 shows the training time of the four models of STAGAN, AUCOA, SSGAN, and UGANGRUD on the three datasets of Healthcare, PerfDS1, and PerfDS2.
It is easy to see from Table 5 that the training efficiency of the GAN series of imputation models is high, among which the STAGAN model has the highest training efficiency but the worst performance, and the training efficiency of other improved models is reduced to varying degrees. The performance of the UGANGRUD model is the first and the training efficiency is second. Since the UGANGRUD model increases the computation of the uncertainty matrix and the training of the GRU neural network, it is not as efficient as the STAGAN model. However, the performance of the UGANGRUD model far exceeds that of the STAGAN model. Considering both performance and efficiency, the UGANGRUD model is the best choice.
Discussion on scalability and limitations
Missing value imputation is used to restore data in realworld domains and plays an important role in intelligent decisionmaking. Although the method proposed in this paper is limited by the characteristics of electronic medical records research, it can be tried in scenarios with high missing rates. For example, in our experiments, we have attempted to employ the method proposed in this paper to process the datasets involved in references [8, 10, 21], etc., and the experimental results show that their performance has been improved to some extent. Since the research task of this paper is the missing value imputation of electronic medical records, no further research and experimental comparisons have been conducted on this. This will be one of the contents of our future research.
Conclusion
The missing of electronic medical records is a commonly observed phenomenon that holds significant research value. In this paper, we propose a missing value imputation model called UGANGRUD based on uncertainty matrix and time decay factor. UGANGRUD consists of two important components: UGAN, an improvement on traditional GAN, which includes a generator G, a discriminator D, and an uncertainty matrix U; and GRUD, an improvement on traditional GRU, which introduces the time decay factor. We conducted experimental studies, and the results show that UGANGRUD not only surpasses existing stateoftheart methods in terms of imputation performance but also performs well in supporting subsequent classification and regression tasks.
The future research direction is to explore the interaction of correlated features [32,33,34] and their impacts on imputation performance. We believe that this will motivate new algorithm discoveries.
Availability of data and materials
The data and material that support the findings of this study are available on request from the corresponding author upon reasonable request.
References
Mathura BB, Mangathayaru N, Padmaja RB, et al. Mathura (MBI)A novel imputation measure for imputation of missing values in medical datasets. Recent Adv Comput Sci Commun. 2021;14(5):1358–69.
Xie F, Yuan H, Ning YL, et al. Deep learning for temporal data representation in electronic health records: a systematic review of challenges and methodologies. J Biomed Inform. 2022;126:103980.
China Health and Nutrition Survey (CHNS). An open dataset of biomarker data. 2015. https://www.cpc.unc.edu/projects/china/en .
Park S, Li CT, Han S. Learning sleep quality from daily logs, 25th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). 2019. p. 2421–9.
Robertson T, Beveridge G, Bromley C. Allostatic load as a predictor of allcause and causespecific mortality in the general population: Evidence from the Scottish. Health Survey. 2017;12(8):1–14.
Yoon J, Jordon J, Schaar M. Gain: missing data imputation using generative adversarial nets. In: Proceedings of International Conference on Machine Learning (ICML 2018). 2018. p. 5689–98.
Guo ZJ, Wan YM, Ye H. A data imputation method for multivariate time series based on generative adversarial network. Neurocomputing. 2019;360:185–97.
Miao X, Wu Y, Wang J, et al. Generative semisupervised learning for multivariate time series imputation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2021). 2021. p. 8983–91.
Cao W, Wang D, Li J, et al. Brits: Bidirectional recurrent imputation for time series. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS 2018). 2018. p. 6775–85.
Wang SY, Wengen HS, Guan JH, et al. STAGAN: a spatiotemporal attention generative adversarial network for missing value imputation in satellite data. Remote Sens. 2023;15(1):1–20.
Benchekroun M, Chevallier B, Istrate D, et al. Preprocessing methods for ambulatory HRV analysis based on hrv distribution, variability and characteristics (DVC). Sensors. 2022;22(5):1984.
Nickerson P, Baharloo R, Davoudi A, et al. Comparison of gaussian processes methods to linear methods for imputation of sparse physiological time series. In: Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 2018. p. 4106–9.
Zhang A, Song S, Wang J. Sequential data cleaning: a statistical approach. In: Proceedings of the 2016 International Conference on Management of Data (ICMD 2016). 2016. p. 909–24.
Singh GN, Khalid M, Kim JM. Some imputation methods to deal with the problems of missing data in twooccasion successive sampling. Commun Stat Simul Comput. 2021;50(2):557–80.
Ma Z, Tian H, Liu Z, et al. A new incomplete pattern belief classification method with multiple estimations based on KNN. Appl Soft Comput. 2020;90:106175.
Chen M, Chen C. Optimize neural network algorithm of missing value imputation for clustering chocolate product type following “steams” methodology. In: Proceedings of 35th international conference on computers and their applications (CATA 2020). 2020. p. 230–41.
Tang J, Zhang X, Yin W, et al. Missing data imputation for traffic flow based on combination of fuzzy neural network and rough set theory. J Intel Transp Syst. 2021;25(5):439–54.
Fernandes S, Antunes M, Gomes D, et al. Misalignment problem in matrix decomposition with missing values. 2021 IEEE 8th International Conference on Data Science and Advanced Analytics (DSAA). Porto, Portugal: 2021. https://doi.org/10.1109/DSAA53316.2021.9564115.
Rios R, Miller RJH, Manral N, et al. Handling missing values in machine learning to predict patientspecific risk of adverse cardiac events: Insights from REFINE SPECT registry. Comput Biol Med. 2022;145:1–10.
Ouyang J, Zhang Y, Cai X, et al. ImputeRNN: imputing missing values in electronic medical records. In: Proceedings of 26th International Conference on Database Systems for Advanced Applications (DASFAA 2021). 2021. p. 413–28.
Shukla PK, Stalin S, Joshi S, et al. Optimization assisted bidirectional gated recurrent unit for healthcare monitoring system in bigdata. Appl Soft Comput. 2023;138:1–11.
Zheng H, Shi D. Using a LSTMRNN based deep learning framework for icu mortality prediction. In: Proceedings of 15th International Conference on Web Information Systems and Applications (WISA 2018). 2018. p. 60–7.
Shi Z, Wang S, Yue L, et al. Deep dynamic imputation of clinical time series for mortality prediction. Inf Sci. 2021;579:607–22.
Wu ZJ, Ma C, Shi XH, et al. BRNNGAN: generative adversarial networks with bidirectional recurrent neural networks for multivariate time series imputation. In: Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS 2021), vol. 2021. 2021. p. 217–24.
Cheng CH, Huang SF. A novel clusteringbased purity and distance imputation for handling medical data with missing values. Soft Comput. 2021;25(17):11781–801.
Duhayyim MAI, AlWesabi FN, Marzouk R. Integration of fog computing for health record management using blockchain technology. CMCComput Mater Continua. 2022;71(2):4135–49.
Lee YK, Pae DS, Hong DK, et al. Emotion recognition with shortperiod physiological signals using bimodal sparse autoencoders. Intelligent Automation and Soft Computing. 2022;32(2):657–73.
China Health and Retirement Longitudinal Study (CHARLS). An open dataset of CHARLS. 2020. http://charls.pku.edu.cn/en/.
Ahn H, Sun K, Kim KP. Comparison of missing data imputation methods in time series forecasting. CMCComput Mater Continua. 2022;70(1):767–79.
Somappa L, Menon AG, Singh AK, et al. A portable system with 0.1ppm RMSE Resolution for 1–10 MHz resonant MEMS frequency measurement. IEEE Trans Instrum Meas. 2020;69(9):7146–57.
Jahan S, Riley I, Walter C, et al. MAPEK/MAPESAC: An interaction framework for adaptive systems with security assurance cases. Futur Gener Comput Syst. 2020;109:197–209.
Long LJ, Yin YF, Huan FL. Hierarchical attention factorization machine for ctr prediction. In: Prediction, Proceedings of 27th International Conference on Database Systems for Advanced Applications (DASFAA 2022), vol. 13246 LNCS. 2022. p. 343–58.
Yin YF, Huang CH, Sun JQ. Multihead selfattention recommendation model based on feature interaction enhancement. In: IEEE International Conference on Communications (IEEE ICC), vol. 2022May. 2022. p. 1740–5.
Hu YL, Gao FL, Sun YF, et al. Feature interaction based graph convolutional networks for imagetext retrieval. In: Proceedings of 30th International Conference on Artificial Neural Networks (ICANN, vol. 12893. 2021. p. 217–29.
Acknowledgements
We would like to thank our funding agency for their support, the staff and faculty of [the national natural science foundation of China] for their assistance, the study participants for their time and contributions, our colleagues and collaborators for their feedback and insights.
Funding
This research work has been partially supported by the national natural science foundation of china (61962038), the fundamental research funds for the central universities (2023CDJYGRHYB11), and the open research fund of Guangxi key lab of human–machine interaction and intelligent decision (gxhiid2208).
Author information
Authors and Affiliations
Contributions
Yunfei Yin designed the main model and wrote the main manuscript text; Zheng Yuan, Islam Tanvir implemented the model and verified it by experiments; Xianjian Bao checked and revised the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yin, Y., Yuan, Z., Tanvir, I.M. et al. Electronic medical records imputation by temporal Generative Adversarial Network. BioData Mining 17, 19 (2024). https://doi.org/10.1186/s13040024003722
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040024003722