Here, we have reviewed five statistical methods available to impute missing data in genomic studies. We used coding and non-coding variants extracted from ClinVar to artificially generate three missing data scenarios (MCAR, MAR and MNAR). After testing 6 different imputation methods, we found that kNN (and in most cases RF) better infer missing values.
This is supported by the single- (Fig. 1) and multiple-column approach (Fig. 3). For the former, algorithm-based approaches have both similar small RMSE values in all missing data scenarios, and the rest (Amelia, mice and MI) showed poor performance. The mean RMSE difference between these two groups is 0.06, 0.05 and 0.12 for MCAR, MNAR and MAR respectively. The difference is particularly high in MAR, and even the difference between the best performant RF and kNN is the highest (0.02). MAR missing data scenarios are the most complex ones and its missingness is dependent on other variables in the matrix, hence it is expected that predictive algorithms perform better at imputation (e.g. kNN and RF).
When imputing multiple columns at once (Fig. 3), RF and kNN algorithms generate imputations remarkably better than Amelia, mice and MI in all three case scenarios. When looking at RMSE distributions, the latter three partially overlap with an imputation-by-the-mean approach, while RF and kNN are clearly non-overlapping. This is a more realistic situation in which data structure could really impact the performance of an algorithm. Figure 3 shows how algorithm-based methods increase their performance when compared to a non-algorithmic approach as complexity in missing data structure increases. These results imply that an algorithm-based approach is preferred compared to a mean-value imputation, especially with complex missing data scenarios. As mentioned earlier, data MNAR is an increasing issue in genomics, particularly when working with non-coding variants’ annotation. These results indicate that algorithmic approaches should be preferred to impute missing data in the context of genomic annotation. Furthermore, high-performing algorithms such as RF or kNN likely benefit from underlying data structures inherent to MNAR and MAR scenarios. To be noted, RF loses inference power when simulating an extreme structured missing data scenario (see Sec. 3.3), while kNN still shows good performance (supplementary figure S3). When looking at columns independently we notice a block of co-ocurring missing values for 54 % of the variants in the following features: CADD, DANN, FATHMM, fitCons, phyloP7, phyloP20, phastCons20, phastCons7, SiPhy, GERP and MutationTaster, limiting observed values to GWAVA and Kaviar. A principal component analysis shows 41.7 % of the variance is explained by the first component and 9.0 % by the second. Features correlated with this first component are CADD, DANN, FATHMM, phyloP7, phyloP20, phastCons20, phastCons7, GERP and SiPhy (Supplementary figure S5). Most features of the missing data block, except for fitCons and MutationTaster. Features correlated with the second component are fitCons, MutationTaster, GWAVA and Kaviar. Having one data point in the first group and one in the second could provide information for a proper imputation using kNN. Often the only observed features in a variant are GWAVA and Kaviar, both correlated with the second component. These two show some correlation with the first eigenvector (-0.25 and 0.24 respectively), meaning that some information is also added to the group correlated with the first component. Having one data point in the first group and one in the second seems to provide enough information for a proper imputation in both algorithmic approaches. When this is not the case RF fails to capture information from the first principal component, while kNN seems to do so by better estimate the neighbors in a 12-dimensional space.
Considering the features, some of them performed intrinsically worse than others, e.g. phastCons scores are poorly imputed by all five algorithms, even though phastCons20 is used for phastCons7 imputation. Both features not only co-occur in the same missing data block in a real example (supplementary table S2) but also are correlated in the variants that do have values, 0.41. Supplementary table S4 shows the correlations between all features. Bad performance seems to be driven by a U-like distribution in which most values are found at both extremes of the score (see phastCons20 and phastCons7, Fig. 2).
In contrast, when looking at features with a more robust imputation (GERP, phyloP20 and phyloP7) they show values towards the higher end of the distribution (Fig. 2). In these cases, both algorithmic and non-algorithmic approaches tend to perform well (Fig. 1). Again, phyloP20 and phyloP7 are highly correlated (supplementary table S2 and S4), with a value of 0.7.
Moreover, FATHMM also shows a U-like distribution with extremes values and bigger tails. In this case, the mean would perform even worse than with other U-like distributions. The mean value of true FATHMM is 0.67 and a variance of 0.13 (three times higher than other features, see below).
The mean-value approach decently fits for imputations in fitCons, GWAVA and Kaviar. The latter is a frequency column, which is biased towards the lower values in this particular data set, since most of the variants uploaded to ClinVar are of clinical relevance, hence low in frequency. The mean of the Kaviar frequency is 0.018 and variance 0.009. A good performance when imputing with the mean is therefore expected in this type of feature. This might not hold for more heterogeneous data sets with higher frequency variants. Similarly, GWAVA has a relatively normal distribution (Fig. 2 C), with a mean of 0.39 and standard deviation of 0.11. In this case the mean value will approximate the vast majority of true values found at the center of the distribution.
fitCons is a fitness score that estimates the probability that a point mutation at each position in a genome will influence fitness. The distribution of the probability values are in this case centered around 0.62 with a small variance of 0.016, which makes the mean a decent estimator.
Even though RF slightly outperforms kNN at multiple-column and single-column imputations, the running time and complexity of that algorithm are to be considered. Running time for one RF iteration (with parameters ntree = 13, mtry = 2 and parallelize = “forests”) took approximately 10 h, while the same iteration for kNN (with k = 23) took approximately 8 h. Therefore, accounting for data size, computing power and time restraints, each user will have to pick its most suitable algorithm accordingly.
It is worth mentioning that the random approach was made only with a 1,000 iterations, which might not be sufficiently representative of the whole sampling space.
Moreover, we have worked with around 30,000 variants and 12 features. Current genomic data sets might be orders of magnitude higher [9]. For these scenarios, one can use interesting alternatives based on Spark (SparkR, Spark ML) to scale out and improve R performance [31]. Additionally, RF performs very poorly in extreme missing data structures which are frequent in genomic contexts.
Altogether we have reviewed several imputations methods and have proposed a couple of suitable algorithms to impute genomic annotation. Additionally, we have developed an R package to test the users own data.