Development of glaucoma predictive model and risk factors assessment based on supervised models

Objectives To develop and to propose a machine learning model for predicting glaucoma and identifying its risk factors. Method Data analysis pipeline is designed for this study based on Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology. The main steps of the pipeline include data sampling, preprocessing, classification and evaluation and validation. Data sampling for providing the training dataset was performed with balanced sampling based on over-sampling and under-sampling methods. Data preprocessing steps were missing value imputation and normalization. For classification step, several machine learning models were designed for predicting glaucoma including Decision Trees (DTs), K-Nearest Neighbors (K-NN), Support Vector Machines (SVM), Random Forests (RFs), Extra Trees (ETs) and Bagging Ensemble methods. Moreover, in the classification step, a novel stacking ensemble model is designed and proposed using the superior classifiers. Results The data were from Shahroud Eye Cohort Study including demographic and ophthalmology data for 5190 participants aged 40-64 living in Shahroud, northeast Iran. The main variables considered in this dataset were 67 demographics, ophthalmologic, optometric, perimetry, and biometry features for 4561 people, including 4474 non-glaucoma participants and 87 glaucoma patients. Experimental results show that DTs and RFs trained based on under-sampling of the training dataset have superior performance for predicting glaucoma than the compared single classifiers and bagging ensemble methods with the average accuracy of 87.61 and 88.87, the sensitivity of 73.80 and 72.35, specificity of 87.88 and 89.10 and area under the curve (AUC) of 91.04 and 94.53, respectively. The proposed stacking ensemble has an average accuracy of 83.56, a sensitivity of 82.21, a specificity of 81.32, and an AUC of 88.54. Conclusions In this study, a machine learning model is proposed and developed to predict glaucoma disease among persons aged 40-64. Top predictors in this study considered features for discriminating and predicting non-glaucoma persons from glaucoma patients include the number of the visual field detect on perimetry, vertical cup to disk ratio, white to white diameter, systolic blood pressure, pupil barycenter on Y coordinate, age, and axial length.

improve glaucoma diagnosis and management accuracy. In this study, the main aim is to design and develop machine learning models for glaucoma prediction. For this purpose, demographic characteristics, optometry, biometry, perimetry features, and ophthalmologic examination results are used as the input variables.
The main novelty of this study lies in several folds, including: Proposing and designing a two-step classification task: The first step includes training the base and single classifiers on the training dataset and evaluating their performance based on a subset of the training dataset named as validation dataset, finding the superior classifiers. The second step is designing a novel stacked ensemble classifiers based on the superior classifiers. Using comprehensive ophthalmology features like perimetry and biometry to develop the glaucoma prediction model without any fundus features. Features of this study come from non-interventional ophthalmologic examinations. To address a highly imbalanced dataset with 87 instances in the glaucoma class (1.9 % of all instances) without generating artificial instances. Analysing a cohort dataset.
To propose and develop models to screen and diagnose diseases using structured datasets and complex data such as medical images and gene expression data can assist the physicians in early managing and diagnosing glaucoma. [10,11,31,32] Table 1 summarizes the related works considering model development for screening and/or diagnosis of glaucoma.
As listed in Table 1, several previous studies have proposed models for diagnosing and screening glaucoma using different features and datasets. The mentioned models can be used to assist physicians in early diagnosis and decision-making. Fundus images in most previous studies were used as the vital input data for developing glaucoma diagnosing and screening models. In a few previous studies, genome data were used to improve glaucoma diagnosing and screening models because genetic and race are two common risk factors for glaucoma identified in previous researches. Although fundus images and genome data have an excellent performance in diagnosing glaucoma, these data types increase the costs and complexity of models. These two data types with some structured data from other ophthalmologic examinations and clinical data can improve the performance of models.
This study used different features, including demographic characteristics, optometry, biometry, perimetry, and ophthalmologic examination results, to help glaucoma prediction.

Materials and methods
In this study, the 'Shahroud eye cohort study' dataset [34] is analyzed for predicting glaucoma based on Cross-Industry Standard Process for Data Mining (CRISP-DM) [35] methodology. Figure 1 shows the main steps of this study and the proposed method to predict glaucoma.
All preprocessing, modeling, and evaluation were done in Python language, and visualization was done in Python and Microsoft Excel in this study.
The steps as shown in Fig. 1 are described with more details in the following subsections:

Data description
Shahroud eye cohort study was started in 2009 to diagnose and detect visual impairments and eye diseases in Shahroud city, Northeast Iran. [34].
In the first phase of the Shahroud eye cohort study, 6311 people aged 40-64 years were selected by random cluster sampling. Among them, 5190 individuals, including 2990 females, have been participated in the study. Fundus imaging and perimetry examination have been performed for 4694 participants. Assessing fundus and perimetry images by the ophthalmologists for glaucoma detection was performed. According to this study, the prevalence rate of glaucoma was estimated at 1.92 % of the population [5] among the considered persons, eighty-nine participants diagnosed as glaucoma patients. Table 2 lists the demographic characteristics, optometry, ophthalmology, biometry, and perimetry examinations for the contributors in the first phase of the Shahroud eye cohort study.
However, some participants who have more than 30 % missing values have been eliminated from this study. Therefore, 67 variables describing 4474 non-glaucoma persons and 87 glaucoma persons are considered in this study.

Data preprocessing
The dataset should be partitioned into two non-overlapping datasets, including training and test datasets. For this purpose, K-fold Cross-Validation (C.V.) was used once with K=5 and K=10. The main steps of data preprocessing and preparation performed in this study were divided into two main categories: data cleaning and balanced sampling. Data cleaning steps are outlier detection and removal, missing value handling, data normalization, and one-hot encoding. Three different scenarios were evaluated and compared for data balancing or not. The first scenario was imbalanced data. The second and third scenarios used over-sampling and under-sampling strategies for balancing the training dataset.
For outlier detection, numerical variables are analyzed using the interquartile range (IQR) as a commonly used outlier detection method [36]. According to this method, no outlier is detected in numerical variables.
Categorical nominal variables were converted to dummy binary variables. Missing value imputation was performed for variables having a missing value rate lower than Fig. 1 The main steps of this study proposed method for glaucoma prediction from Shahroud eye cohort study dataset 30 %, and other variables were excluded from the study. Missing values were replaced with mean and mode for numerical variables and binary variables, respectively.
Features in our analysed dataset have different units or come from different examinations, and features have different value range. Some learning algorithms, similar distance-based methods like K-Nearest Neighbors (K-NN) or kernel machines like Support-Vector Machines (SVM), are sensitive to features ranges. This sensitivity can cause models bias to features with higher variations. Data normalization was applied to the Min-Max normalization method to avoid dominating variables with a low, extensive range of variations and improve evaluation metrics.
Since our analyzed dataset's glaucoma prevalence rate was about 1.9 %, the class distribution was significantly imbalanced. On the other hand, more than 92 % of the considered persons belong to the non-glaucoma group (majority class), and only 1.9 % of the persons are assigned to the glaucoma class (minor class). The previous studies have shown that the classifiers trained on imbalanced datasets can have higher accuracy for classifying the major class. However, the minor class cannot be trained with high accuracy [37].
To overcome the imbalance distribution in this study, dataset between the classes, three different strategies applied were under-sampling, over-sampling, and combining over-sampling and under-sampling strategy used in the previous studies [37].
In this study, three different scenarios were designed and compared to address the imbalance dataset's challenges. The first was sampling from data without balancing the class distribution (Scenario 0). The second was uses over-sampling from the minor class (Scenario1). The last was the balanced bagging ensemble method, which has been proposed to overcome the imbalance dataset challenges in a previous study (Scenario 2) [38]. The bagging ensemble using a high number of estimators can guarantee that the major class observations contribute to training one of the estimators.

Modeling
Classifiers considered in this study include well-known classifier Decision Trees (DT) [39], Support Vector Machines, and ensemble classifiers having DT as their base classifiers, such as Extra Tree (ET) [40] and Random Forests (RF) [41]. These classifiers have different advantages and were used as a base classifier in three scenarios. For tuning the hyperparameters of the classifiers and choosing the resampling ratio, the Grid search method was used in this study. For DT and the ensemble based on DT like RF and ET, the splitting criterion was the 'Information Gain,' and the number of trees was 200. The kernel function of SVMs was 'Radial Basis Function (RBF)' in Scenario1 and Scenario2, and Polynomial' in Senario3. The number of neighbours in the K-NN was seven, and the distance metric was 'Euclidean.' In Scenario2, the oversampling ratio was different for each classifier and was determined by the Grid search. In Scenario3, the number of estimators of the balanced bagging ensemble method was 300. This number of the base estimator guarantee that every sample in the training set contributes at least to train one of the estimators and avoid overfitting.

Evaluation
As illustrated in Fig. 1, evaluation tasks include the K-fold C.V. strategy for sampling from data, choosing the best scenario for handling imbalanced data, choosing the best classifiers, identifying the important features, building the proposed stacking ensemble model, and error analysis.
As mentioned in the preprocessing subsection, data was partitioned into training and test datasets based on the K-fold C.V. strategy for K=5 and K=10. The best scenario for handling imbalanced data and classifiers was chosen by comparing different combinations of the scenarios and classifiers based on their performance on the validation dataset. Then, important features ranked with the best classifiers were identified. Afterward, the proposed two-layered stacking ensemble classifier was built in which the first layer included different classifiers and the second layer used a logistic regression model.
For error analysis, evaluating and comparing the classifiers and different scenarios used for balanced sampling, various standard performance measures which were calculated and reported were accuracy sensitivity, specificity, F1-Score, and area under receiver operating characteristics (ROC) curve (AUC) as Eqs. (1)-(4).
Experimental results For evaluating the proposed scenarios, every classifier in each scenario was executed 30 times, and the performance measures, including accuracy, sensitivity, specificity, and F1-Score, are reported in Table 3.
As illustrated by Table 3, the last scenario's classifiers had better performance for predicting the non-glaucoma class and glaucoma class. Decision trees and random forests which outperform the other classifiers were contributed to the balanced bagging ensemble as its base classifiers.
The ROC curves are illustrated in Fig. 2 to compare different scenarios and classifiers.
According to DT and RF's reasonable accuracy in the last scenario, top-ranked features were selected based on the feature importance scores assigned to the variables with DT and RF. 10-top ranked variables in the first quartile for glaucoma prediction from the Shahroud eye cohort dataset based on their average score on 300 executions of the classifiers are listed in Table 4.
As illustrated by Table 4, eight variables among 10-top ranked features identified by DT and RF are common, and they include NVD, VCDR, WTW, Systolic BP, PCY, AST, Age, and AL, which come from the different examination sources.
Finally, a novel two-layered stacking ensemble classifier is proposed in which the first layer combines two superior classifiers of the last scenario and the second layer uses logistic regression. Table 5 shows the performance measures of the proposed stacking ensemble classifier for Glaucoma prediction.

Discussion
For assessing the performance of the proposed stacking ensemble classifier in this study, 30 executions of 10-fold C.V. are analyzed. According to the experimental results, 3200 persons belonging to the non-glaucoma class (71.5 %) and 59 glaucoma persons (67.8 %) are correctly predicted in all executions. On the other hand, 368 persons of non-glaucoma class (8.2 %) and glaucoma class (8 %) are misclassified in all executions. Figure 3 indicates how many times of the executions each instance is predicted correctly.
According to Fig. 3, a novel confusion matrix (NCM) is proposed based on the thresholds for at least 0 %, 30 %, 60 %, 90 %, and 100 % instances that could be predicted correctly, as shown in Table 6. For example, if the threshold is 60 %, TP and TN show that the instances are correctly classified into positive (Glaucoma) and negative (non-glaucoma) at least 18 times of 30 executions, respectively.
The results shown in Table 5 are similar to the results corresponding to the threshold of 60 % in Table 6. According to the experimental results described in this section, data instances that no classifier can correctly predict negatively impact the classifier performance. FP instances are the data instances with glaucoma, but the classifier has misclassified them as the non-glaucoma class label.
The average of top-ranked features for non-glaucoma and glaucoma classes is compared using the t-student test shown in Table 7 to investigate the significant difference between the non-glaucoma and glaucoma classes per each top-ranked feature.
As listed in Table 7, the average of LT, Spherical Equivalent, AST, Systolic BP, AL, Age, VCD, and NVD are significantly different for glaucoma and non-glaucoma groups. It indicates that the mentioned variables can distinguish well two classes in our study. However, the average of WTW, PCY, BMI, and Diastolic BP is not significantly different for the glaucoma and non-glaucoma groups.

Conclusions
Early identification of the persons with a high risk of glaucoma can help early beginning the necessary treatment and monitoring disease and prevent converting disease to the acute form. In this study, a novel stacking ensemble classifier composed of several machine learning classifiers is proposed, designed, and used for glaucoma prediction considering the Shahroud eye cohort dataset. This study's input variables and predictors for glaucoma prediction are demographic characteristics, ophthalmology features,   The previous studies used three different data types: fundus images, genome data, and structured data to develop a glaucoma prediction and diagnosis model. These studies achieved high-performance measures on fundus images or the combination of different data types, as shown in Table 8. This study used extensive ophthalmologic examinations like biometry, perimetry, and some clinical data to develop the predictive glaucoma model without fundus images or genome data. The developed model in this study has lower performance measures against other studies. Still, it has less complexity and cost and can use as the base of a decision support system in clinics to diagnose and screen glaucoma.
Top-ranked features for predicting glaucoma identified using DT and RF are listed in Table 4. These top features come from different eye examinations like perimetry and biometry or demographic features that can measure in every clinic. As discussed in the introduction, some of these top features are the main risk factors of glaucoma diseases like age, BMI, blood pressure, and axial length, identified in many studies. On the other hand, some of these features like NVD and VCDR are used to identify glaucoma by physicians instantly. As mentioned in a previous study [5], top-ranked features for glaucoma prediction determined using simple and multivariate logistic regression have been  This study aims to discriminate between glaucoma patients and non-glaucoma persons. The proposed and designed models in this study are disable to diagnose glaucoma type. Different types of glaucoma can be discriminated against, predicted, and diagnosed using machine learning models as a future research direction. Top-ranked features and risk factors for each type of glaucoma can be identified.
The first phase of the Shahroud Eye Cohort Study was used to predict glaucoma with extensive ophthalmologic examinations and demographic data without any fundus images and achieve an average accuracy of 83.56. The Shahroud Eye Cohort Study was conducted in two more phases with an interval of five years. For future work, the glaucoma condition of participating in the second phase can be use as the label for the first     This study has some main differences compared to the previous related works. In this study, different ophthalmological features are used as the input variables of our models, such as optometric examination results, biometric and perimetric features, ophthalmologic examinations. Moreover, top-ranked features include the variables describing ophthalmologic examination results. Using longitudinal data collected for 5-years provide us to assess the future trends and changes for glaucoma in people contributing to the cohort study.