 Methodology
 Open Access
 Published:
Machine Learning Algorithms for understanding the determinants of underfive Mortality
BioData Mining volume 15, Article number: 20 (2022)
Abstract
Background
Underfive mortality is a matter of serious concern for child health as well as the social development of any country. The paper aimed to find the accuracy of machine learning models in predicting underfive mortality and identify the most significant factors associated with underfive mortality.
Method
The data was taken from the National Family Health Survey (NFHSIV) of Uttar Pradesh. First, we used multivariate logistic regression due to its capability for predicting the important factors, then we used machine learning techniques such as decision tree, random forest, Naïve Bayes, K nearest neighbor (KNN), logistic regression, support vector machine (SVM), neural network, and ridge classifier. Each model’s accuracy was checked by a confusion matrix, accuracy, precision, recall, F1 score, Cohen’s Kappa, and area under the receiver operating characteristics curve (AUROC). Information gain rank was used to find the important factors for underfive mortality. Data analysis was performed using, STATA16.0, Python 3.3, and IBM SPSS Statistics for Windows, Version 27.0 software.
Result
By applying the machine learning models, results showed that the neural network model was the best predictive model for underfive mortality when compared with other predictive models, with model accuracy of (95.29% to 95.96%), recall (71.51% to 81.03%), precision (36.64% to 51.83%), F1 score (50.46% to 62.68%), Cohen’s Kappa value (0.48 to 0.60), AUROC range (93.51% to 96.22%) and precisionrecall curve range (99.52% to 99.73%). The neural network was the most efficient model, but logistic regression also shows well for predicting underfive mortality with accuracy (94% to 95%)., AUROC range (93.4% to 94.8%), and precisionrecall curve (99.5% to 99.6%). The number of living children, survival time, wealth index, child size at birth, birth in the last five years, the total number of children ever born, mother’s education level, and birth order were identified as important factors influencing underfive mortality.
Conclusion
The neural network model was a better predictive model compared to other machine learning models in predicting underfive mortality, but logistic regression analysis also shows good results. These models may be helpful for the analysis of highdimensional data for health research.
Introduction
Underfive mortality is the most widely used indicator to measure the health status of children. It is also an index of the general development of any country. Underfive mortality is the probability of children dying before their fifth birthday. Worldwide, underfive mortality rates are higher in the SouthAsian and SubSaharan African countries. In India, the underfive mortality rate has reduced from 83 deaths per 1000 live births in 2000 to 42 deaths in 2017 [1]. Statewise reports have found that underfive mortality is highest in Uttar Pradesh, followed by Madhya Pradesh and Chhattisgarh [2], as shown in Fig. 1. Although there has been a significant reduction in underfive deaths in these states, it remains a major issue for child health in developing countries like India. Understanding the important factors in explaining childhood mortality is integral to reducing the death rate, but it is not enough.
Nowadays, Machine learning (ML) techniques are highly used in public health research. Various machine learning models have been used to predict and classify various health and biomedical data. These ML models can automatically identify interactions and find the nonlinear relationship between the target variable and independent variables. Machine learning approaches can be utilized to discover the exposures related to health outcomes of interest and the potential interactions between those exposures [3]. Various machine learning prediction and classification models like regression, logistic regression, principal component analysis (PCA), decision trees, and maximum likelihood methods have been used to find the accurate estimation of health data. These approaches could help to obtain early prediction and insight into the important factors for underfive mortality. A study by Ethiopian provides evidence of J48 machine learning and artificial neural network (ANN) techniques to find the causes of child mortality [4]. Another study showed that the machine learning model effectively predicted the undernutrition status of underfive children in the Ethiopian administrative zones [5]. The studies assessed the machine learning technique’s performance to predict the risk of neonatal mortality and morbidity [6, 7]. A study used iterative dichotomiser3(ID3), random forest, and decision tree models to predict the nutritional status of underfive children [8]. Another Indian study predicted the nutrient effects on human health using machine learning techniques [9]. So far in our literature search, no published study which used the machine learning model technique to predict underfive mortality was available. Also, past studies have found a lack of a generic prediction framework for accurately estimating child mortality rates using machine learning techniques.
There is a need for accurate prediction and classification models to provide highly accurate results and allow health researchers to experiment with various sets of aspects. This study offers an opportunity to assess the accuracy or efficacy of the machine learning models and find the important factors with the help of the information gain method in studying underfive mortality.
Methodology
This study’s methods have been explained step by step through a framework for underfive mortality prediction. The data analysis of this study was performed in various steps. Firstly, the multivariate logistic regression analysis was performed to find the important factors (p < 0.05) thereafter machine learning model’s approaches were applied to the dataset. The explanations of the machine learning frameworks are portrayed in Fig. 2. All the analyses of the data were conducted using Python 3.3, STATA 16.0, and SPSS27 software.
Importance of ML methods over traditional methods
A study has shown that a machine learning framework can be used to detect significant risk factors of underfive mortality and that deep learning techniques are superior to logistic regression for the classification of child survival [10]. Machine learning models can accurately predict neonatal, perinatal, and infant mortality [11,12,13]. Several studies done to predict the bankruptcy of banks have shown that intelligent techniques (specifically ANN) seem to work more effectively than statistical techniques. ANN and KNN methods perform more effectively than traditional methods [14].
Dataset
National Family Health Survey (NFHSIV) is a largescale, multiround crosssectional, national representative survey conducted in households throughout the Indian states and union territories and is one of the most extensive data collection methods to help keep records across India. The reports are summarized from districtwise to statewise. The survey collects extensive information on population, health, and nutrition, with an emphasis on women and young children. In this study, we have used secondary data from the NFHSIV survey of Uttar Pradesh. We have used the target group data of underfive children of Uttar Pradesh. This dataset has records for every woman interviewed whose child was born in the past five years preceding the survey. It contains information related to the mother’s pregnancy, postnatal care, and health. This file was used to obtain information related to child health indicators such as immunization coverage, vitamin A supplementation, recent occurrences of diarrhoea, fever, and cough for young children, and treatment of childhood diseases. A total of 1377 variables were available in this dataset. There was a total of 41,751 samples/individuals, out of which underfive mortality was 2830.
Study variables
According to an analytical framework for child survival in developing countries [15], we have used 19 (out of 1377 variables) most important variables that were related to underfive mortality, as most of the variables were not useful for this study. Due to missing values, only 15 variables were used for the analysis, which included the outcome/target variable. A missing value is defined as a variable that should have a response but does not have a response either because the question was not asked (due to interviewer error) or the respondent did not want to answer. The outcome/target (dependent) variable was underfive mortality which is known as the death of a child before completing 59 months.
The predictor (independent) variables considered in this study were mothers’ educational level, births in the last five years, any exposure, currently breastfeeding, total number of living children, wealth index, mass media exposure (MXP), survival time, the total number of children ever born, desire for more children, sex of the child, childsize at birth, ANC visits and birth order.
Data preprocessing
After making the final dataset, the next step was to preprocess the data by using various methods. In this step, the duplicates and missing variables were removed using the predictive mean matching method. Thereafter, all string and categorical variables were transformed into numerical values.
An important point in data preprocessing is the need to balance the target or outcome variable. In the dataset, the numbers of underfive mortality were highly skewed as compared to live children (38,921 live children vs 2830 underfive mortality). A random oversampling method was used to balance the target (dependent), after which a ratio of 50:50 was obtained as compared to the early ratio of 93:7.
Feature selection
The idea of feature selection is about ranking the major risk factors from the dataset according to their importance. This is based on the calculation of the information gain values for each of the selected variables. In this study, we have used a random forest model to find the risk factors or important features that have a major contribution to child mortality. The higher information gain values tell us important variables that are highly correlated with the class of variable. We randomly selected the top eight ranked information values, which we used in the model building later.
Model building
Data Splitting
In this step, we split the datasets into trained and test data. 70% of the trained data are used for the model classification and 30% of the data for model evaluation. Again, we will split the datasets into trained and tested (80% and 20% respectively) for a clear idea of a classification model. All the independent features needed to be changed in onehot encoding to build better predictive models. In this study, the dependent variable was binary, i.e., dead/alive. We then used various suitable machine learning models, namely decision tree, random forest, Naïve Bayes, KNN model, logistic regression, SVM, neural network, and ridge classifier.
Decision Tree (DT)
The decision tree is one of the most intuitive and straightforward techniques in machine learning based on the divide and conquers paradigm [16]. In a decision tree technique, tests (on input patterns) and categories (of patterns) are used as inner and leaf nodes, respectively. This technique also assigns a class number to an input array by filtering the array down via the tests in the tree [12].
Random Forest (RF)
The random forest algorithm takes hyperparameters, identifying the number of trees and the maximum depth of each tree. The random forest is a combination of learning approaches for the classification in machine learning and uses a vast collection of decorrelated decision trees [17].
Support Vector Machine (SVM)
The SVM is a supervised machine learning technique for analyzing and recognizing patterns of data [18]. New observations are predicted based on class and the side of the partition they fall in. The SVM is the nearest data point to the hyperplane that divides the classes.
Logistic Regression (LR)
Logistic regression is a statistical classification probabilistic model that predicts the probability of occurrence of an event. The logistic regression model is used to model the categorical dependent variable and a dichotomous categorical outcome or feature. It is a binary (multiple) model used to predict binary (multiple) responses [16]. The predictors need to be independent and significantly associated with the outcome variables [19].
Naive Bayes (NB)
Naive Bayes is a simple machine learning algorithm based on the Bayes theorem, and it has a necessary assumption that the attributes are conditionally independent for the given class. Naive Bayes gives competitive classification accuracy [20]. Naïve Bayes is widely applied because of its computational efficiency and desirable features [21].
K Nearest Neighbours (KNN)
The KNN is a simple and effective nonparametric method of classification, and it is effective in many cases [22]. To classify the data record ‘t’, its ‘k’ nearest neighbour is collected, forming a neighbourhood ‘t’. Most points among the data records in the neighbourhood is mainly used to decide the classification for ‘t’ with or without consideration of distancebased weighting. While applying the KNN, we choose an appropriate value for ‘k’, and the classification success depends on this value. There are several methods of determining k values, but the simplest one is to run the algorithm many times with varying k values and choose the best performance [23].
Neural network
Neural networks reflect the human brain's behavior and allow computer programs to find patterns and solve common problems in machine learning, artificial learning, and deep learning. ANN comprises a node layer that contains an output layer, an input layer, and one or more hidden layers [24]. Each node connects to another and has an associated weight and threshold. If the output of an individual node exceeds the given threshold value, that node is activated and sends data to the next layer of the network.
Ridge regression
Ridge regression is a method for estimating the multipleregression models' coefficients when the independent variables are highly correlated. This method was developed as a possible solution to the imprecision of least squares estimators with multicollinearity among the independent variables in the linear regression model [25]. Ridge parameter estimates are more precise because their mean square error and variance are smaller than the least square estimators.
Evaluation for predictive models
In this study, to predict the best model for underfive mortality, evaluation was conducted by various indices such as confusion matrix, sensitivity, specificity, precision, accuracy, F1 score, negative predictive value, Cohen’s Kappa values, and AUROC. All the details as given below:
Confusion matrix
The confusion matrix visualizes the actual and predicted class accuracies [26]. To examine the performance of the classification algorithm, the confusion matrix compares the predicted classification versus actual classification through the measures; true positive (TP), false positive (FP), true negative (TN), and falsenegative (FN), and the formulas are given below.

True positive (TP) – The model correctly predicts positive class in the outcome.

True negative (TN) –The model correctly predicts negative class in the outcome.

Falsepositive (FP) – The model incorrectly predicts a positive class in the outcome.

Falsenegative (FN) –The model incorrectly predicts a negative class in the outcome.

Sensitivity – Sensitivity is the test to measure correctly positive predicted events out of a total number of positive events. This gives us the value of how many positives are predicted out of total positive classes. This is known as recall and can be calculated by the given formula:
$$\mathbf{S}\mathbf{e}\mathbf{n}\mathbf{s}\mathbf{i}\mathbf{t}\mathbf{i}\mathbf{v}\mathbf{i}\mathbf{t}\mathbf{y}/\mathbf{R}\mathbf{e}\mathbf{c}\mathbf{a}\mathbf{l}\mathbf{l}=\frac{\mathbf{T}\mathbf{P}}{\mathbf{T}\mathbf{P}+\mathbf{F}\mathbf{N}}$$
Specificity – Specificity is the measure that tells us the proportion of correctly predicted negative outcomes among all total negative outcomes. It can be calculated by the given formula:
Precision – Precision is the correct events divided by the total number of positive events that the classifier predicts. This is also known as positive predictive value. In this study, it was used to check the model output from the given formula below and it was calculated from the confusion matrix:
Negative predictive value – The negative predictive value is defined as the number of true negatives divided by the total number of people who test negative.
Accuracy – Accuracy is the percentage of true events among the total number of cases tested. In this study, it was used to determine model efficacy and measure from the confusion matrix.
F1 score—The inverse relationship between accuracy and recall is the F1 score or the F test. The higher value of the F1 score predicts a better model. The harmonic mean of recall and accuracy is determined as.
Cohen’s Kappa—Cohen’s Kappa is a coefficient used to assess the performance of the binary classification model [27]. It is a very useful evaluation statistic coefficient when working with imbalanced data. Cohen’s Kappa (k) is calculated by the given formula:
where \({p}_{o}\) is the overall accuracy of the model and is the measure of the agreement between the model predictions and the actual class values as if happening by chance? It can range from 0 to 1, with 0 representing no agreement and 1 representing the perfect agreement between classes.
Area under Receiver Operator Characteristic (AUROC) Curve
The Receiver Operator Characteristic curve is the probability curve that shows the relationship between sensitivity and specificity. This curve is the most used metric for binary classification outcomes. The Field under the ROC shows how well the probabilities are segregated from the negative classes by the positive classes. When the AUC value is close to 1, the model prediction indicates better, while the value near 0 indicates bad model efficiency. In this study, we use this measure for the model’s efficiency.
Precisionrecall curve
The precisionrecall curve is a combination of sensitivity (xaxis) and precision(yaxis). It’s used as an alternative to roc curves [28]. The high precision relates to a low false positive rate, while high recall is related to low false. The maximum area under the curve denotes both high precision and high recall. The highest score for both measures indicates that the classifier is producing results that are mostly positive (high recall) and accurate (high precision).
Results
The multivariate logistic regression analysis was applied to predict the important factors in underfive mortality data. Table 1 shows births in the last five years, breastfeeding status, sex of the child, number of living children, child size at birth, sex of the child, birth order, survival time, children ever born, and desire for more children were important factors for underfive mortality.
The machine learning models, namely decision tree, random forest, Naïve Bayes, KNN, logistic regression, SVM, neural network, and ridge classifier were applied to build a predictive model of underfive mortality. A comparison of 70% training and 30% validation, 80% training, and 20% validation was done by eight machine learning models including various evaluation measures with and without important data factors.
All predictive models of underfive mortality were applied to training data of 70% with all factors. The models were tested on test data 30%. The performance of predictive models was evaluated and compared using various metrics namely confusion matrix, sensitivity, specificity, precision, accuracy, F1 score, negative predictive value, Cohen’s Kappa values, and AUROC curve. The result of the model evaluation is shown in Table 2 for 70% of the trained data. The results showed that the neural network model had predicted underfive mortality at 95.96% highest accuracy with a recall (81.03%), precision (51.83%), F1 score (62.68%), and Cohen’s Kappa value (0.60). The result indicates that the neural network model was the best predictive model for underfive mortality compared to other predictive models. The ROC curve is shown in Fig. 3, and the precisionrecall curve is shown in Fig. 4. Both curves of the neural network model show the highest AUROC (96.4%) and highest precisionrecall curve (99.7%), again indicating that it is the best predictive model among all models. The secondbest model shows logistic regression analysis with 94.5% AUROC and 99.6% precisionrecall curve value.
Again, all predictive models of underfive mortality were applied to training data of 80% with all factors to get a better idea regarding the accuracy or efficacy of the model. The result of the model evaluation is shown in Table 2 for 80% of trained data. The result indicated that the neural network model was the best predictive model for underfive mortality compared to other predictive models. The result findings found that the neural network model has predicted underfive mortality at 95.96% highest accuracy with recall (79.27%), precision (51.83%), F1 score (62.68%), and Cohen’s Kappa value (0.60). The ROC curve is shown in Fig. 5, and the precisionrecall curve is shown in Fig. 6. The curve of the neural network model shows the highest AUC (93.87%), and highest precisionrecall curve (99.7%), indicating it is the best predictive model among the models. The secondbest model shows the logistic regression model with 94.8% AUROC and 99.6% precisionrecall curve value.
After that, we used a random forest model to find the risk factors or important features that had a major contribution to the mortality of underfive children. We used the information gain rank method of random forest to check feature importance concerning its predictive power.
We selected only the top eight best features for the model’s accuracy. The result of feature importance is shown in Fig. 7. The result showed that the most important determinants of underfive mortality were the number of living children, survival time, wealth index, child size at birth, birth in the last five years, total children ever born, mother’s education level, and birth order because they were high rank in order. After that, we repeated all procedures with important factors to know the importance of information gain measures or very important features.
All machine learning models, namely decision tree, random forest, Naive Bayes, KNN, logistic regression, SVM, neural network, and ridge classifier were applied to build a predictive model of underfive mortality in training data of 70% with eight important factors.
The models were tested on test data 30%. The result of the model evaluation is shown in Table 3 for 70% of trained data. The result indicates that the neural network model was the best predictive model for underfive mortality compared to other predictive models. The result showed that the neural network model had predicted underfive mortality at 95.31% highest accuracy with recall (81.03%), precision (36.64%), F1 score (50.46%), and Cohen’s Kappa value (0.48). The ROC curve is shown in Fig. 8, and the precisionrecall curve is shown in Fig. 9. The curve of the neural network model showed the highest AUC (93.51%), and the precisionrecall curve (99.5%) indicated it is the best predictive model among the models. The logistics regression model indicated the best second model with 93.3% AUROC and 99.5% precisionrecall curve value.
Again, all predictive models of underfive mortality were applied to training data of 80% with eight important factors. The models were tested on test data 20%. The result of the model evaluation is shown in Table 3 for 80% of trained data. The result found that the neural network model predicted underfive mortality at 95.29% highest accuracy with recall (71.51%), precision (45.05%), F1 score (55.28%), and Cohen’s Kappa value (0.53), indicating it is the best predictive model among the models. The ROC curve is shown in Fig. 10, and the precisionrecall curve is shown in Fig. 11. The curve of the neural network model shows the highest AUC (93.95%) and the precisionrecall curve (99.5%) is the best predictive model among the models. The secondbest model was a logistic regression with 94.8% AUROC and 99.5% precisionrecall curve value. Finally, the result declared that the neural network classifier model is the most accurate model for predicting underfive mortality in the predictive analytics structure. The result also confirms that the machine learning model shows better output accuracy than the traditional statistical model and the information gain ranked method predicts the underfive mortality factors.
Discussion
This study predicts the important factors of underfive mortality using logistic regression analysis and a machine learning model. This study evaluated the importance of machine learning techniques in predicting the factors of underfive mortality. This is the first study that used machine learning techniques in high underfive mortality data of an Indian state Uttar Pradesh, to predict underfive mortality. To find better accuracy of machine learning models, we applied two different ratios i.e. 70/30 and 80/20 and we observed that the 70/30 ratio was the appropriate ratio for the model and this result is justified by previous studies [29, 30]. This study showed that the neural network predictive model is better than another predictive model for predicting the factors of underfive mortality data. Concerning the predictive analysis, the prediction accuracy was (95.29% to 95.96%), recall (71.51% to 81.03%), precision (36.64% to 51.83%), F1 score (50.46% to 62.68%), Cohen’s Kappa value (0.48 to 0.60) AUROC (93.4% to 96.5%) and precisionrecall curve (99.5% to 99.7%) in the neural network model compared to other predictive models. The study also shows that logistic regression analysis is close to the neural network method in this data and the model seems to perform with near similar accuracy. However, we were unable to demonstrate that one technique is better than the other. The various research articles found that neural networks were superior to logistic regression [31,32,33]. The articles found no differences between LR, and neural networks and some articles found that logistic regression was better than neural networks [34, 35]. It may not be possible to determine which model is superior to the other in each dataset but the neural network’s ability to detect the complex nonlinear relationship and all possible interactions between predictor variables. The neural network gives impressive results from an overfitted model including various free parameters while logistic regression has less potential for overfitting. All variables in a dataset are rarely useful for developing machine learning models. Adding maximum variables in the analysis reduces the competence and accuracy of the models. Thus, feature selection is an important tool in machine learning to find the important factors that are useful in machine learning models.
The feature information gain method showed that the number of living children, time, wealth index, child size at birth, birth in the last five years, total children ever born, mother’s education level, and birth order are the top eight important predictors for underfive mortality.
Various studies also confirmed that these factors are crucial for underfive mortality [3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]. From this study, we can confirm that the wealth index was one of the important factors for underfive mortality, which is in line with other studies [40,41,42]. This study found time was a significant factor in underfive mortality [43, 44]. This study’s findings observed that the mother's education level was one of the major underfive mortality factors. Previous studies also confirmed that the mother’s education played an important role in reducing the risk of underfive mortality [45,46,47,48]. This may be because educated mothers might have better knowledge about the health services, care, and immunization of their children.This study found that the number of births in the last five years and birth order is an important factor related to underfive mortality.
Previous studies have shown that the likelihood of underfive mortality increases with the number of births in the last five years and the total number of children ever born [49,50,51]. These results are like the study reported using the ML approach [52] and traditional methods [53].
It has been shown that child size at birth plays a significant role in underfive mortality and a similar result was found in previous research also [54, 55]. A study reported that neural network has higher predictive accuracy for underfive mortality prediction [56]. The neural network model is stable in forecasting infant mortality rates as compared to the conventional logistic regression model and performs more accurately in predicting fiveyear mortality also [57, 58].
This approach can predict and simulate the mortality rates in the human population and make accurate predictions of mortality risk for most preterm infants [59, 60]. Previous research also confirms that machine learning model methods are better than traditional analysis methods [61, 62]. A previous study predicted that machine learning models are more suitable for finding the factors of infant mortality and confirming better goodness of fit in most critical groups [63]. Moreover, machine learning models are very valuable in predicting health studies that lead to healthier and more suitable policy decisions.
Study limitation
This study cannot be complete without its limitation because we have used machine learning models, unlike statistical models. The machine learning model's result comes without any coefficient and odds ratio compared to the statistical model and is difficult to understand how much and in which direction, factors affect the outcome. Another limitation is that we need to decide our research hypothesis in the study, but machine learning models cannot frame research hypotheses for prediction and classification both. The results of the study are based on NFHSIV questionnaires’ data. It is not a specific study, nor has precise objectives related to underfive mortality. There were various missing variables in the dataset and those variables were not included in the study.
Conclusion
The objective of this study was to apply the various Machine Learning models to underfive mortality data.
This study explains the ML accuracy and predicts the important factors related to underfive mortality.
The neural network model performed best in predicting underfive mortality with the highest accuracy compared to this study's other machine learning models. The study also indicates that logistic regression analysis can be useful in predicting the mortality of underfive morality with some limitations. However, this study also highlighted that some of the variables have an equally significant impact on underfive mortality in both LR and ML models. The number of children, survival time, child size at birth, birth in the last five years, the total number of children ever born, and birth order were found to be the most important factors for underfive mortality. The machine learning models provide some important factors that may add to analysis capabilities compared to other traditional statistical models. These models may be helpful for the analysis of highdimensional data for health research.
Availability of data and materials
The data of the National Family Health Survey is available online. The International Institute for Population Sciences (IIPS), Mumbai website is the nodal agency for the NFHS4 survey. This data is freely available to access for research anyone after registration. (http://rchiips.org/nfhs/nfhs4.shtml). The source code adds as supplementary file in this article.
References
IIPS, ICF. National Family Health Survey (NFHS4), 2015–16: India. Mumbai: International Institute for Population Sciences 2017.
http://rchiips.org/nfhs/NFHS4Reports/India.pdf (access on 23/07/2021 at 2.50 PM (IST)).
Patel CJ. Analytic complexity and challenges in identifying mixtures of exposures associated with phenotypes in the exposome era. Current epidemiology reports. 2017;4(1):22–30.
Tesfaye B, Atique S, Elias N, Dibaba L, Shabbir SA, Kebede M. Determinants and development of a webbased child mortality prediction model in resourcelimited settings: a data mining approach. Comput Methods Programs Biomed. 2017;140:45–51.
Fenta HM, Zewotir T, Muluneh EK. A machine learning classifier approach for identifying the determinants of underfive child undernutrition in Ethiopian administrative zones. BMC Med Inform Decis Mak. 2021;21:291.
Alves LC, Beluzo CE, Arruda NM, Bressan R, Carvalho T. Assessing the Performance of Machine Learning Models to Predict Neonatal Mortality Risk in Brazil, 2000–2016. medRxiv. 2020.
Jaskari J, Myllärinen J, Leskinen M, Rad AB, Hollmén J, Andersson S, Särkkä S. Machine learning methods for neonatal mortality and morbidity classification. IEEE Access. 2020;8:123347–58.
Thangamani D, Sudha P. Identification of malnutrition with use of supervised data mining techniques–decision trees and artificial neural networks. Int J Eng Comput Sci. 2014; 3(09).
Kuttiyapillai D, Ramachandran R. Improved text analysis approach for predicting effects of nutrient on human health using machine learning techniques. IOSR J Comput Eng. 2014;16(3):86–91.
Adegbosin AE, Stantic B, Sun J. Efficacy of deep learning methods for predicting underfive mortality in 34 lowincome and middleincome countries. BMJ open. 2020 1;10(8)
Mangold C, Zoretic S, Thallapureddy K, Moreira A, Chorath K, Moreira A. Machine Learning Models for Predicting Neonatal Mortality: A Systematic Review. Neonatology. 2021;118(4):394–405.
Rahman A, Hossain Z, Kabir E, Rois R. Machine Learning Algorithm for Analysing Infant Mortality in Bangladesh. International Conference on Health Information Science 2021; 205–219.
Shukla VV, Eggleston B, Ambalavanan N, McClure EM, Mwenechanya M, Chomba E, Bose C, Bauserman M, Tshefu A, Goudar SS, Derman RJ. Predictive modeling for perinatal mortality in resourcelimited settings. JAMA Netw Open. 2020;3(11): e2026750.
Le HH, Viviani JL. Predicting bank failure: An improvement by implementing a machinelearning approach to classical financial ratios. Res Int Bus Financ. 2018;44:16–25.
Mosley WH, Chen LC. An analytical framework for the study of child survival in developing countries. Popul Dev Rev. 1984;10:25–45.
Podgorski K. Introduction to Data Science Laura Igual and Santi Seguí Springer, 2017.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Burges CJ. A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc. 1998;2(2):121–67.
Agresti A. Categorical data analysis. John Wiley & Sons; 2003.
Suresh K, Dillibabu R. Designing a machine learningbased software risk assessment model using Naïve Bayes algorithm. TAGA J. 2018;14:3141–7.
Webb GI, Keogh E, Miikkulainen R. Naïve Bayes Encyclopedia of machine learning. 2010;15:713–4.
Guo G, Wang H, Bell D, Bi Y, Greer K. KNN modelbased approach in classification. InOTM Confederated International Conferences" On the Move to Meaningful Internet Systems" 2003 Nov 3 (pp. 986996). Springer, Berlin, Heidelberg.
Muller KR, Mika S, Ratsch G, Tsuda K, Scholkopf B. An introduction to kernelbased learning algorithms. IEEE Trans Neural Networks. 2001;12(2):181–201.
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.
Gruber MH. Improving efficiency by shrinkage: the JamesStein and ridge regression estimators. Routledge; 2017.
Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur. 1960;20(1):37–46.
Saito T, Rehmsmeier M. The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one. 2015;10(3):e0118432.
Goldstein BA, Navar AM, Carter RE. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. Eur Heart J. 2017;38(23):1805–14.
Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of classification techniques. Emerging artificial intelligence applications in computer engineering. 2007;160(1):3–24.
Zernikow B, Holtmannspoetter K, Michel E, Pielemeier W, Hornschuh F, Westermann A, Hennecke KH. Artificial neural network for risk assessment in preterm neonates. Archives of Disease in ChildhoodFetal and Neonatal Edition. 1998;79(2):F12934.
Shi HY, Lee KT, Lee HH, Ho WH, Sun DP, Wang JJ, Chiu CC. Comparison of artificial neural network and logistic regression models for predicting inhospital mortality after primary liver cancer surgery. PloS one. 2012 ;7(4).
Chen TJ, Hsu YH, Chen CH. Comparison of Neural Network and Logistic Regression Analysis to Predict the Probability of Urinary Tract Infection Caused by Cystoscopy. BioMed Research International. 2022;2022.
Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of clinical epidemiology. 1996;49(11):1225–31.
Steering Committee of the Cardiac Care Network of Ontario*, Tu JV, Weinstein MC, McNeil BJ, Naylor CD. Predicting mortality after coronary artery bypass surgery: what do artificial neural networks learn?. Medical Decision Making. 1998;18(2).
Majumder AK, May M, Pant PD. Infant and child mortality determinants in Bangladesh: Are they changing? J Biosoc Sci. 1997;29(4):385–99.
Hong R, Hor D. Factors associated with the decline of underfive mortality in Cambodia, 2000–2010: Further analysis of the Cambodia Demographic and Health Surveys. Calverton: ICF International. s. 2013.
Dendup T, Zhao Y, Dema D. Factors associated with underfive mortality in Bhutan: an analysis of the Bhutan National Health Survey 2012. BMC Public Health. 2018;18(1):1–5.
Islam M, Usman M, Mahmood A, Abbasi AA, Song OY. Predictive analytics framework for accurate estimation of child mortality rates for Internet of Things enabled smart healthcare systems. Int J Distrib Sens Netw. 2020;16(5):1550147720928897.
Adegbosin AE, Stantic B, Sun J. Efficacy of deep learning methods for predicting underfive mortality in 34 lowincome and middleincome countries. BMJ Open. 2020;10(8): e034524.
Van Malderen C, Amouzou A, Barros AJ, Masquelier B, Van Oyen H, Speybroeck N. Socioeconomic factors contributing to underfive mortality in subSaharan Africa: a decomposition analysis. BMC Public Health. 2019;19(1):1–9.
Bizzego A, Gabrieli G, Bornstein MH, DeaterDeckard K, Lansford JE, Bradley RH, Costa M, Esposito G. Predictors of contemporary under5 child mortality in lowand middleincome countries: a machine learning approach. Int J Environ Res Public Health. 2021;18(3):1315.
Kandala NB, Ghilagaber G. A geoadditive Bayesian discretetime survival model and its application to spatial analysis of childhood mortality in Malawi. Qual Quant. 2006;40(6):935–57.
Pedersen J, Liu J. Child mortality estimation: appropriate time periods for child mortality estimates from full birth histories,2012.
Bitew FH, Nyarko SH, Potter L, Sparks CS. Machine learning approach for predicting underfive mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey. Genus. 2020;76(1):1–6.
Campbell AA, de Pee S, Sun K, Kraemer K, ThorneLyman A, MoenchPfanner R, Sari M, Akhter N, Bloem MW, Semba RD. Relationship of household food insecurity to neonatal, infant, and underfive child mortality among families in rural Indonesia. Food Nutr Bull. 2009;30(2):112–9.
Kembo J, Van Ginneken JK. Determinants of infant and child mortality in Zimbabwe: Results of multivariate hazard analysis. Demogr Res. 2009;21:367–84.
Mandal S, Paul P, Chouhan P. Impact of maternal education on underfive mortality of children in India: insights from the National Family Health Survey, 2005–2006 and 2015–2016. Death Stud. 2021;45(10):788–94.
Abir T, Agho KE, Page AN, Milton AH, Dibley MJ. Risk factors for under5 mortality: evidence from Bangladesh Demographic and Health Survey, 2004–2011. BMJ Open. 2015;5(8):e006722.
Amoroso CL, Nisingizwe MP, Rouleau D, Thomson DR, Kagabo DM, Bucyana T, Drobac P, Ngabo F. Next wave of interventions to reduce underfive mortality in Rwanda: a crosssectional analysis of demographic and health survey data. BMC Pediatr. 2018;18(1):1–1.
Kayode GA, Adekanmbi VT, Uthman OA. Risk factors and a predictive model for underfive mortality in Nigeria: evidence from Nigeria demographic and health survey. BMC Pregnancy Childbirth. 2012;12(1):1–1.
Panesar SS, D’Souza RN, Yeh FC, FernandezMiranda JC. Machine learning versus logistic regression methods for 2year mortality prognostication in a small, heterogeneous glioma database. World neurosurgery: X. 2019;2:100012.
Hemo SA, Rayhan MI. Classification tree and random forest model to predict underfive malnutrition in Bangladesh. Biom Biostat Int J. 2021;10(3):116–23.
Budu E, Ahinkorah BO, Ameyaw EK, Seidu AA, Zegeye B, Yaya S. Does birth interval matter in UnderFive mortality? Evidence from demographic and health surveys from eight countries in West Africa. BioMed Research International. 2021 ;2021.
Adeyinka DA, Muhajarine N. Time series prediction of underfive mortality rates for Nigeria: comparative analysis of artificial neural networks, HoltWinters exponential smoothing and autoregressive integrated moving average models. BMC Med Res Methodol. 2020;20(1):1–1.
Nyoni SP, Nyoni T. Forecasting infant mortality rate in Gabon using artificial neural networks. International Research Journal of Innovations in Engineering and Technology. 2021;5(3):592.
Shi HY, Lee KT, Wang JJ, Sun DP, Lee HH, Chiu CC. An artificial neural network model for predicting 5year mortality after surgery for hepatocellular carcinoma: a nationwide study. J Gastrointest Surg. 2012;16(11):2126–31.
Hainaut D. A neuralnetwork analyzer for mortality forecast. ASTIN Bulletin: The Journal of the IAA. 2018;48(2):481–508.
Zernikow B, Holtmannspoetter K, Michel E, Pielemeier W, Hornschuh F, Westermann A, Hennecke KH. Artificial neural network for risk assessment in preterm neonates. Arch Dis Child Fetal Neonatal Ed. 1998;79(2):F129–34.
Bhattacharjee B. Child Health in India: An Application of Machine Learning. Turkish Journal of Computer and Mathematics Education (TURCOMAT).2021;12(8):2122–7.
Dwomoh D, Amuasi S, Agyabeng K, Incoom G, Alhassan Y, Yawson AE. Understanding the determinants of infant and underfive mortality rates: a multivariate decomposition analysis of demographic and health surveys in Ghana, 2003, 2008 and 2014. BMJ Glob Health. 2019;4(4): e001658.
L. J. B. Caluza, “Machine Learning Algorithm Application in Predicting Children Mortality: A Model Development,” Int. J. Inf. Sci. Appl.,2018;1(1–6).
Ashrafian H, Darzi A. Transforming health policy through machine learning. PLoS Med. 2018;15(11): e1002692.
Author information
Authors and Affiliations
Contributions
RKS developed the concept of the paper, analysed the data, made the algorithm for model evaluation and comparison, and wrote a major part of the manuscript. PKY helped in the writing of the manuscript, cleansed the data, and ran an appropriate analysis in the software. RS and ONC helped with the construction of the manuscript, coding of the dataset, and giving accurate comments for revising the manuscript. All authors read and approved the final paper.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study analyzed a secondary data set and had no identifiers of the survey participants. This dataset is easily available in the public domain for research purposes; hence no approval was required from any institutional review board as there is no question of human subject protection arising in this case.
Consent for publication
Not applicable.
Competing interests
The authors declared that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Saroj, R.K., Yadav, P.K., Singh, R. et al. Machine Learning Algorithms for understanding the determinants of underfive Mortality. BioData Mining 15, 20 (2022). https://doi.org/10.1186/s13040022003088
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040022003088
Keywords
 Underfive mortality
 Machine learning
 Random Forest
 Neural Network
 Accuracy