Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records

Chung, Chih-Wei; Chou, Seng-Cho; Hsiao, Tzu-Hung; Zhang, Grace Joyce; Chung, Yu-Fang; Chen, Yi-Ming

doi:10.1186/s13040-023-00352-y

Research
Open access
Published: 05 January 2024

Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records

Chih-Wei Chung¹,
Seng-Cho Chou¹,
Tzu-Hung Hsiao^2,3,4,
Grace Joyce Zhang⁵,
Yu-Fang Chung⁶ &
…
Yi-Ming Chen^{2,7,8,9,10,11}

BioData Mining volume 17, Article number: 1 (2024) Cite this article

1620 Accesses
3 Citations
Metrics details

Abstract

Background

Although the 2019 EULAR/ACR classification criteria for systemic lupus erythematosus (SLE) has required at least a positive anti-nuclear antibody (ANA) titer (≥ 1:80), it remains challenging for clinicians to identify patients with SLE. This study aimed to develop a machine learning (ML) approach to assist in the detection of SLE patients using genomic data and electronic health records.

Methods

Participants with a positive ANA (≥ 1:80) were enrolled from the Taiwan Precision Medicine Initiative cohort. The Taiwan Biobank version 2 array was used to detect single nucleotide polymorphism (SNP) data. Six ML models, Logistic Regression, Random Forest (RF), Support Vector Machine, Light Gradient Boosting Machine, Gradient Tree Boosting, and Extreme Gradient Boosting (XGB), were used to identify SLE patients. The importance of the clinical and genetic features was determined by Shapley Additive Explanation (SHAP) values. A logistic regression model was applied to identify genetic variations associated with SLE in the subset of patients with an ANA equal to or exceeding 1:640.

Results

A total of 946 SLE and 1,892 non-SLE controls were included in this analysis. Among the six ML models, RF and XGB demonstrated superior performance in the differentiation of SLE from non-SLE. The leading features in the SHAP diagram were anti-double strand DNA antibodies, ANA titers, AC4 ANA pattern, polygenic risk scores, complement levels, and SNPs. Additionally, in the subgroup with a high ANA titer (≥ 1:640), six SNPs positively associated with SLE and five SNPs negatively correlated with SLE were discovered.

Conclusions

ML approaches offer the potential to assist in diagnosing SLE and uncovering novel SNPs in a group of patients with autoimmunity.

Peer Review reports

Introduction

Systemic Lupus Erythematosus (SLE) is a chronic autoimmune disease characterized by a wide range of clinical manifestations and diverse autoantibody profiles. Diagnosis of SLE is notoriously complex, often requiring a careful clinical evaluation and laborious serological testing [1]. The 2019 European League Against Rheumatism/American College of Rheumatology (EULAR/ACR) classification criteria for SLE stipulated the need for a positive anti-nuclear antibody (ANA) test with a titer equal to or exceeding 1:80 [2]. Despite this advancement, the precise identification of SLE patients continues to pose significant challenges for clinicians, mainly due to the heterogeneity of the disease and its overlap with other autoimmune diseases [3].

Over the past decade, there has been a substantial rise in the adoption of machine learning (ML) techniques in medical diagnosis, as they provide robust tools capable of deciphering intricate patterns and relationships within voluminous datasets [4, 5]. This computational approach has demonstrated significant utility in various clinical domains, from predicting disease outbreaks to personalizing treatment strategies [6, 7].

In the context of SLE, ML applications have shown promise in addressing several critical aspects of the disease, including refining diagnosis, identifying disease flares, predicting patient prognosis, and uncovering genetic and environmental risk factors. For instance, ML models have been utilized to identify potential biomarkers and create prediction models for disease activity, damage accrual, organ-specific involvement in SLE, and therapeutic response [8, 9]. Another study utilized ML algorithms to develop a risk probability index for SLE using clinical and demographic data [10]. Given the complexity and heterogeneity inherent in SLE diagnosis, applying ML techniques can significantly improve disease identification by utilizing both genomic data and electronic health records (EHRs). Consequently, these advancements can potentially revolutionize SLE management and research, contributing to improved patient outcomes [11].

Previously, we constructed an ML model to predict genomic susceptibility to SLE and rheumatoid arthritis (RA) [12]. That study also led to the discovery of genetic variations at the human leukocyte antigen (HLA) region crucial for differentiating RA from SLE. However, in clinical practice, rheumatologists generally diagnose SLE through the combination of an ANA test result and clinical feature assessment. Currently, no studies have reported the integration of ML models into the clinical diagnosis workflow of SLE. Therefore, this study aimed to fill this gap and explore the potential role of ML models in streamlining and improving the diagnosis of SLE.

This study aimed to employ six machine learning models—Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), Light Gradient Boosting Machine (LGBM), Gradient Tree Boosting (GTB), and Extreme Gradient Boosting (XGB)—to improve the identification of SLE patients using genomic data and EHRs from the Taiwan Precision Medicine Initiative (TPMI) cohort.

Materials and methods

Study population & study design

This study followed a retrospective case–control design utilizing data from the TPMI. The TPMI assembled EHRs and collected specimens from participants at the Taichung Veterans General Hospital (TCVGH), Taiwan, from June 2019 to June 2020, as previously described [12]. The analysis included participants who tested positive for ANA with a titer equal to or exceeding 1:80. Cases consisted of 946 patients diagnosed with SLE based on the 2012 Systemic Lupus International Collaborating Clinics classification criteria for SLE [13]. The control group, at a 1:2 ratio, was comprised of TPMI participants who tested positive for ANA but were not diagnosed with SLE. The study protocol received approval from the Ethics Committee of TCVGH (SF19153A), and all participants provided written informed consent.

Genotyping

At TCVGH, we extracted DNA by automated platforms. Genotyping for each participant was conducted using the Taiwan Biobank version 2 (TWBv2) array, provided by Thermo Fisher Scientific, Inc. (Santa Clara, CA, USA). This array is specifically tailored for Genome-Wide Association Studies (GWAS), targeting known risk alleles. It encompasses a comprehensive set of 714,431 single nucleotide polymorphisms (SNPs), as delineated by Wei et al. in the prior research [14]. For optimal accuracy and to counteract potential batch inconsistencies, genotype calls were centrally processed by Academia Sinica in batches of 3,000 samples. In both cases and controls, the integrity of each SNP genotyping was assessed by evaluating the overall call rate (indicative of the successful call rate) and the minor allele frequency (MAF). Samples with call rates exceeding 95% were included in subsequent analyses. SNPs were excluded if they met any of the following conditions: If only one allele was present in both cases and controls, if the total call rate was below 95% or if the total MAF was less than 0.01, or if there was a significant departure from the Hardy–Weinberg equilibrium (P < 1 × 10⁻⁴).

ANA test titers and patterns

ANA tests were conducted using an automated Indirect Immunofluorescence (IIF) NOVA View instrument and NOVA Lite HEp-2 ANA kit (Inova Diagnostics, Inc., San Diego, USA), as detailed by Wu et al. [15]. Titers and patterns of the ANA tests were reported in accordance with the expert-level reporting and interpretation principles stipulated by the International Consensus on ANA Patterns (ICAP) [16].

Data extraction and preprocessing

The clinical parameters included a variety of demographic factors, comorbidities, baseline laboratory profiles, ANA titer and pattern profiles, and medication history. The index date was defined as the date of the initial ANA test with a titer equal to or exceeding 1:80, and the primary outcome was defined as the occurrence of SLE within one year following the index date. Comorbid conditions, such as diabetes mellitus, hypertension, and hyperlipidemia, were ascertained based on the patients' ICD codes registered prior to the index date. Baseline laboratory profiles, encompassing parameters such as anti-dsDNA antibody (Anti-dsDNA ab), C3, C4, white blood cell count (WBC), neutrophils, basophils, monocytes, eosinophils, hemoglobin, erythrocyte count, hematocrit, mean corpuscular volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, erythrocyte distribution width, platelet count, serum creatinine, and estimated glomerular filtration rate (eGFR) [17], were evaluated within the year preceding and following the index date, and prior to the outcome date. ANA titers were observed at levels of 1:80, 1:160, 1:320, and ≥ 1:640, and ANA pattern profiles, including AC1, AC4, AC5, AC19, and AC24, were identified within six months after the index date and prior to the outcome date. Medication profiles encompassing glucocorticoids, hydroxychloroquine, cyclophosphamide, cyclosporin, mycophenolate mofetil, and azathioprine were ascertained using historical data extracted from EHRs during the six-month windows preceding and following the index date, and prior to the outcome date. Laboratory data features with missing value percentages exceeding 30% were excluded from the analysis. To address the missing values within the clinical features, the missForest imputation method was employed [18]. Continuous features were normalized using the RobustScaler technique, centering them around the median and scaling according to the interquartile range (IQR), thus ensuring that the ML models remained resilient against outliers [19]. For preprocessing the GWAS data, SNP values were encoded as 0, 1, or 2, representing the number of minor alleles under an additive genetic model [20]. Missing SNP values were imputed using the most frequent value within the training set. Additionally, the polygenic risk score (PRS) was computed using the candidate SNP features, serving as an assessment of individual genetic risk for ANA-positive patients who subsequently developed SLE [21]. The PRS for each SNP was calculated as follows:

$${{\text{PRS}}}_{tj}={\beta }_{j}\times {{\text{SNP}}}_{tj}$$

where β_j represents the effect size of the j^th SNP generated from the logistic regression, and SNP_ij is the feature value of the j^th SNP on the i^th patient.

However, the concept of the PRS did not incorporate the significance of the p-values derived from the association tests for pivotal SNPs. To address this limitation, a refined approach involving the aggregation of the adjusted PRS, which is weighted by the p-values obtained from the additive logistic regression, was introduced. This was executed separately for the top 50% of SNPs exhibiting significant positive and negative effects, yielding the following expressions:

$${{\text{PRSw}}}_{i}^{+}=\sum_{q=1}^{p}{\beta }_{q}\times {{\text{SNP}}}_{iq}\times {-{\text{log}}}_{10}\left({p-value}_{q}\right)$$

$${{\text{PRSw}}}_{i}^{-}=\sum_{q=1}^{p}{\beta }_{q}\times {{\text{SNP}}}_{iq}\times {-{\text{log}}}_{10}\left({p-value}_{q}\right)$$

where p and n are the total number of SNPs selected from the top 50% of significant p-values from the positive and negative effects, respectively.

Feature selection

The initial step involved the application of the GWAS methodology to preselect SNPs exhibiting a strong association with ANA-positive patients diagnosed with SLE [20]. To discern the most pertinent SNP attributes, an association test was employed using the univariate logistic regression method [22]. Candidate SNPs were singled out based on a p-value threshold of less than 1 × 10^–3, a measure taken to mitigate the effects of the high-dimensional nature inherent to GWAS analyses. To mitigate concerns related to overfitting, the default RF algorithm was employed to identify the top 5% most impactful SNPs for subsequent utilization as candidate features in the ensuing ML model construction process.

Supervised Machine learning approaches

To forecast the likelihood of patients developing SLE within one year from the index date, six ML algorithms, LR, RF, SVM, LGBM, GTB, and XGB, were employed [11, 23]. The entire dataset (n = 2,838) was randomly divided into a training set (80%) and a testing set (20%), maintaining proportional representation across both sets. To optimize the ML algorithms' performance, hyperparameter optimization was employed. This optimization involved tuning the parameters using five-fold cross-validation and utilizing the GridSearchCV package within the training set. The validation set was used during the model training and optimization phases [24]. While fine-tuning the hyperparameters, the proposed models employ the training set to reach the optimized hyperparameters, without reference to the testing set. The optimized hyperparameters for each ML model are as described as Supplementary Table 1. Addressing the challenge of class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was employed to balance the occurrences of the minority class [25]. Additionally, the TomekLinks method was implemented to regulate unnecessary instances of the majority class in the training set [25]. For feature interpretation and the pursuit of explainable artificial intelligence (XAI), the SHapley Additive exPlanations (SHAP) method was harnessed. This method enabled the identification of features closely associated with ANA-positive patients afflicted by SLE [26]. SHAP summary plots facilitated the visual representation of the relationship between feature values and the probability of the outcome. In order to robustly evaluate the performance of the ML algorithms in the context of binary classification with class imbalance, a set of metrics was employed. These metrics included accuracy, precision, sensitivity (recall), specificity, F1 score, Area Under the Receiver Operating Characteristic curve (AUROC), and Area Under the Precision-Recall curve (AUPRC). These metrics collectively gauged the efficacy of each classifier model [12]. To evaluate the robustness of the proposed algorithms, the statistical technique of bootstrapping-based resampling is employed to reconstruct the training dataset. Subsequently, these reconstructed sets are repeatedly trained by six ML models within a total of 500 iterations proposed in the phase. The average of 500-iteration training and validation is as quantified as the AUROC. The whole procedure consists of the predictive process and ML methodology, as illustrated in Supplementary Fig. 1.

Statistical analysis

Continuous features are summarized as medians and their corresponding IQR, and their distributions were evaluated using a Wilcoxon rank-sum test. Binary features are represented as counts and percentages, and their associations were examined using either a Chi-square test or Fisher’s exact test, as appropriate. To uncover the relationships between ANA-positive patients and specific SNPs, as well as to elucidate the reasons for the absence of SLE development in patients with high-tier ANA (≥ 1:640), an association test was conducted. This test involved logistic regression analysis of SLE disease and SNPs in patients with an ANA titer equal to or exceeding 1:640. Univariate and multivariate logistic regression methods were employed to estimate the crude and adjusted odds ratios (aOR), accompanied by their corresponding 95% confidence intervals (CI). The data preprocessing and statistical analyses were executed using the R programming language (version 4.2.2), while the development of the ML models was carried out using Python (version 3.9.7). All statistical tests adhered to a two-sided configuration, with statistical significance set at p-values less than 0.05.

Results

Selection of candidate SNPs associated with SLE and non-SLE controls

Of the 686,438 imputed SNPs, the GWAS analysis identified a specific subset associated with SLE and non-SLE patients, as represented in the Manhattan plot (Fig. 1). Given the excessively stringent threshold of 1 × 10^–5 (red line), the p-value threshold for selecting candidate genetic variants was adjusted to 1 × 10^–3 (blue line), culminating in the selection of 684 SNPs. These SNPs were subsequently incorporated into the PRS calculation and ML model construction.

Baseline characteristics of the study population

A total of 2,838 adults with available clinical data, SNP information, and ANA titers equal to or exceeding 1:80 was recruited for this study. Among them, 946 patients were diagnosed with SLE within one year following the index date (Table 1). The median age for the non-SLE and SLE patients was 45.3 and 42.6 years, respectively. Comparing the SLE and non-SLE patients, notable distinctions were observed. The SLE group exhibited a significantly lower prevalence of diabetes mellitus (p < 0.001), hypertension (p < 0.001), and hyperlipidemia (p < 0.001). Analysis of laboratory profiles revealed that the SLE patients had elevated levels of anti-dsDNA antibody, monocytes, mean corpuscular hemoglobin concentration, erythrocyte distribution width, and eGFR. Conversely, levels of C3, C4, WBC, basophils, eosinophils, hemoglobin, erythrocyte count, hematocrit, and platelets were notably higher in the non-SLE group. Regarding medication profiles, SLE patients had significantly higher usage rates of glucocorticoids, hydroxychloroquine, and azathioprine. However, the proportion of patients using cyclosporin was significantly lower in the SLE group than in the non-SLE group. Assessing the ANA titer results at the index date. The non-SLE group had a significantly greater proportion of ANA titers at 1:80 (p < 0.001) and at 1:160 and 1:320 (p = 0.002) compared with the SLE group. Nonetheless, the contrast between the non-SLE and SLE groups was evident in patients with ANA titers equal to or exceeding 1:640, with respective counts of 306 (16.2%) and 463 (48.9%). Analyzing the ANA pattern profiles, the non-SLE patients demonstrated higher proportions for AC1, AC4, and AC24, whereas AC5 exhibited an inverse trend. Additionally, when examining the median values of the PRS, PRSw + , and PRSw-, all three were significantly higher in the SLE group compared with the non-SLE group: PRS (8.6, IQR: -7.6 to 22.7), PRSw + (85.3, IQR: 59.9 to 118.5), and PRSw- (-57.2, IQR: -90.2 to -42.5).

Table 1 Baseline demographic and clinical characteristics of the study population

Full size table

Comparison of model performance on the unseen testing set

Table 2 presents the evaluation of the performance of six distinct ML algorithms on the unseen testing set using various metrics. Given the class imbalance within this study, particular attention was directed toward the assessment metrics of F1 score, AUROC, and AUPRC. The computed AUROC values consistently exceeded 0.8 across all six models, indicating favorable discrimination ability. Notably, the XGB and RF methodologies exhibited superior performance in the F1 score and AUPRC metrics. For a comprehensive depiction, the ROC and PR curves are presented in Fig. 2. Impressively, the XGB model had the highest performance, with an AUROC of 0.8748 and an AUPRC of 0.8303. The RF model was also noteworthy and yielded commendable results with an AUROC of 0.8637 and an AUPRC of 0.8124. Both the five-fold cross-validation and bootstrapping validation methods reach similar results, with a 95% CI for the AUROC, as shown in Supplementary Table 2.

Table 2 Model Performance of the proposed ML models on the unseen testing set

Full size table

Interpretation of the risk factors associated with SLE in the ML models

To illuminate the contributions of features to the proposed ML models, the SHAP summary plot was employed, displaying the top 20 risk factors (Fig. 3). An upward trend in the SHAP value of a feature corresponds to an increased likelihood of developing SLE within one year. Notably, the features anti-dsDNA ab, AC4, PRS, and ANA titer 1:80 exhibited similar trends and had the highest importance within the XGB and RF models. Furthermore, the SHAP summary values provide novel insights into the significance of individual SNP features. For instance, ANA-positive patients had a comparatively elevated risk of developing SLE within one year when their SNP features, such as rs9547929, rs2243430, and rs16856933, carried genotypes of 0/1 or 1/1. This information offers valuable understanding regarding the genetic factors associated with the development of SLE in ANA-positive patients.

Association between SLE and SNPs among patients with high titer ANA

Clinically high-titer ANA (≥ 1:640) is associated with a diagnosis of SLE. Further logistic regression analyses were conducted on SNPs that showed positive and negative correlations with SLE in the subgroup of participants with high-titer ANA (Table 3). Six SNPs (rs13029062, rs71909377, rs62175694, rs13268741, rs6971633, and rs16905611) positively correlated with an SLE outcome were identified. In addition, five SNPs (rs6455889, rs9910586, rs712735, rs5020365, and rs4346053) were discovered as protective factors against SLE in the high-titer ANA group.

Table 3 Association between SLE and SNPs among patients with high-titer ANA (≥ 1:640) at index date

Full size table

Prediction of SLE in selected patient populations by ML algorithm

Figure 4 delineates the efficacy of the ML model in stratifying selected patients into those with and without SLE based on ANA titers, ANA patterns, SNPs, PRS, and relevant clinical features. As evident in Fig. 4 (A) and (D), patients with ANA titers of 1:80 and 1:640 manifest predicted probabilities of SLE at 0.009 and 0.977, respectively, when factoring in the cumulative influence of ANA patterns, SNPs, PRS, and clinical features that are either suggestive or contraindicative of SLE diagnosis. Conversely, Fig. 4 (B) showcases the capacity of the ML model to accurately negate the diagnosis of SLE in patients presenting with high-titer ANA (1:640), taking into account corresponding ANA patterns, SNPs, PRSw, and clinical features. Similarly, Fig. 4 (C) depicts a scenario where the ML model correctly identified a patient with a low-titer ANA (1:80) as having SLE, based on ANA patterns, SNPs, PRS, PRSw, and pertinent clinical features.

Discussion

This study is the first to construct ML models for the identification of SLE patients from a cohort with positive ANA tests, incorporating genome-wide SNPs, PRS, and clinical features. Autoantibody profiles, ANA patterns, complement levels, and genetic variations were ascertained as principal contributing factors for SLE. Concurrently, in the subgroup with high-titer ANA, a characteristic indicative of SLE diagnosis, SNPs positively and negatively associated with SLE were discerned. These results shed light on the potential for integrating ML models into the diagnostic workflow for systemic autoimmune disease, fostering a more precise, comprehensive, and data-informed approach to patient diagnosis and care.

ML models have been extensively explored for diagnosing SLE, defining clinical phenotypes, determining outcomes, and informing therapeutic decisions [23]. Previous studies utilizing ML to facilitate SLE diagnosis have employed diverse input data, including EHRs, genetic biomarkers, proteomics, lipidomes, or a combination of these data types [23, 24, 27,28,29,30,31,32]. This study is the first to incorporate genome-wide SNPs, PRS, and EHRs in an ML analysis. Moreover, ML algorithms for diagnostic purposes in previous studies included RF, LASSO, SVM, LR, XGB, and Partial Least Square [23, 24, 27,28,29,30,31,32]. This study is the first to attempt to compare the diagnostic accuracy among six ML models. In line with the classification criteria proposed by EULAR/ACR, this study is novel in integrating the ANA test as the primary criterion for cohort enrollment. This study also included the ANA titer and ANA pattern in the input data for the ML models. These innovative aspects of this research pave the way for potential clinical applications, particularly for rheumatologists encountering patients presenting with autoimmune features and a positive ANA test. We postulate that ANA patterns, PRS, SNPs, and autoantibody profiles could provide additional diagnostic insights for SLE. Future investigations are necessary to validate these findings and further explore the potential of this integrated ML approach in diagnosing and managing SLE.

This study identified six SNPs positively associated with SLE and five SNPs negatively associated with SLE in participants with high-titer ANAs. The VIT gene, which is involved in iron transportation, metabolism, and antioxidant protein catalase activity, was among these [33]. Another gene, SLC7A2, encodes a cationic amino acid transporter and has been reported to be associated with inflammatory responses in asthma [34]. The AUTS2 gene, implicated in the neurodevelopmental process and acute lymphoblastic leukemia, was also noted [35, 36]. Lastly, the PACRG gene, linked with Parkinson’s disease and increased susceptibility to leprosy, was identified [37, 38]. Notably, none of these genes have been previously reported in relation to SLE. This study, therefore, presents a potential approach to discovering novel genetic variants associated with autoimmune diseases. Future research is needed to clarify the mechanistic associations between these genes and autoimmunity, further enhancing the understanding of the genetic underpinnings of such diseases.

To delve into potential risk factors and achieve optimal performance, particularly within the context of an imbalanced dataset, the employment of an advanced ensemble ML framework was a judicious choice. Additionally, XAI was applied to assess SLE disease risk and select pivotal SNP features for predictive modeling. Prior research has demonstrated that incorporating SNP and PRS features can substantially enhance disease prediction accuracy [12]. The notion of PRS, encompassing the cumulative effects of numerous candidate SNP features, offers invaluable insights for detecting complex diseases and identifying high-risk patients [21]. However, traditional computation methods of PRS [21, 39, 40] neglect variations of significance of p-value and directional attributes of SNP features [41, 42]. By integrating potential SNPs, the PRS makes the importance of cumulative effect sizes specific. Based on the magnitude of PRS, such as high or low quartiles, we can establish an association with disease progression [42, 43]. Consequently, two novel features resulted from the adjusted PRS are formulated; by the features, the accuracy of predicts for case group and control group can be greatly enhanced. Figure 3 illustrates the outcomes, revealing that control and case patients can be discerned by the PRSw + and PRSw- features, respectively. Notably, this distinction holds even in the context of enrolling ANA-positive patients, which might lead to a somewhat homogenous population within the study.

Despite being the first study to integrate the ANA test and SLE classification workflow as enrollment criteria, this research has several limitations. First, the study design is retrospective, which inevitably leads to incomplete data in the EHRs. Additionally, structured assessments for autoimmune clinical features were not prospectively collected, which might have affected the richness of the input data. Secondly, the input data for the ML models did not encompass cytokine, transcriptomic, or proteomic datasets, thereby possibly limiting the breadth of these mechanistic interpretations. Lastly, the study cohort was exclusively composed of Taiwanese-ethnic Chinese participants. As such, the results may not be universally applicable, restricting their extrapolation to populations of other ancestries.

In conclusion, this study establishes a novel application of ML models for the diagnosis of SLE using genomic and clinical data. The integration of ANA tests, genomic data of SNPs and PRS, and clinical features offer a promising approach for identifying patients with SLE, thereby potentially improving diagnostic precision. Moreover, this research introduces a possible method for discovering novel genetic variants associated with autoimmune diseases. This study offers an encouraging step toward integrating ML into the diagnostic workflow for systemic autoimmune diseases.

Availability of data and materials

Data is available upon reasonable request to the corresponding author with a statistical analysis plan.

References

Jakes RW, Bae SC, Louthrenoo W, Mok CC, Navarra SV, Kwon N. Systematic review of the epidemiology of systemic lupus erythematosus in the Asia-Pacific region: prevalence, incidence, clinical features, and mortality. Arthritis Care Res (Hoboken). 2012;64:159–68.
Article PubMed Google Scholar
Aringer M, Costenbader K, Daikh D, Brinks R, Mosca M, Ramsey-Goldman R, et al. 2019 European League Against Rheumatism/American College of Rheumatology classification criteria for systemic lupus erythematosus. Arthritis rheumatol. 2019;71:1400–12.
Article PubMed PubMed Central Google Scholar
Andraos R, Ahmad A, Eriksson P, Dahlström Ö, Wirestam L, Dahle C, et al. Autoantibodies associated with systemic sclerosis in three autoimmune diseases imprinted by type I interferon gene dysregulation: a comparison across SLE, primary Sjögren’s syndrome and systemic sclerosis. Lupus Science & Medicine. 2022;9:e000732.
Article Google Scholar
Stafford I, Kellermann M, Mossotto E, Beattie RM, MacArthur BD, Ennis S. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ digital medicine. 2020;3:30.
Article CAS PubMed PubMed Central Google Scholar
Chicco D, Oneto L. Data analytics and clinical feature ranking of medical records of patients with sepsis. BioData Min. 2021;14:1–22.
Article Google Scholar
Kim J, Ahn I. Infectious disease outbreak prediction using media articles with machine learning models. Sci Rep. 2021;11:4413.
Article CAS PubMed PubMed Central Google Scholar
Wu J, Qin C, Fang G, Shen L, Li M, Lu B, et al. Machine learning approach to screen new diagnostic features of adamantinomatous craniopharyngioma and explore personalised treatment strategies. Translational Pediatrics. 2023;12:947–66.
Article PubMed PubMed Central Google Scholar
Shipa M, Santos LR, Nguyen DX, Embleton-Thirsk A, Parvaz M, Heptinstall LL, et al. Identification of biomarkers to stratify response to B-cell-targeted therapies in systemic lupus erythematosus: an exploratory analysis of a randomised controlled trial. The Lancet Rheumatology. 2023;5:e24–35.
Article CAS PubMed Google Scholar
Choi MY, Chen I, Clarke AE, Fritzler MJ, Buhler KA, Urowitz M, et al. Machine learning identifies clusters of longitudinal autoantibody profiles predictive of systemic lupus erythematosus disease outcomes. Ann Rheum Dis. 2023;82:927–36.
Article CAS PubMed Google Scholar
Adamichou C, Genitsaridi I, Nikolopoulos D, Nikoloudaki M, Repa A, Bortoluzzi A, et al. Lupus or not? SLE Risk Probability Index (SLERPI): a simple, clinician-friendly machine learning-based model to assist the diagnosis of systemic lupus erythematosus. Ann Rheum Dis. 2021;80:758–66.
Article CAS PubMed Google Scholar
Fazzari MJ, Guerra MM, Salmon J, Kim MY. Adverse pregnancy outcomes in women with systemic lupus erythematosus: can we improve predictions with machine learning? Lupus Science & Medicine. 2022;9:e000769.
Article Google Scholar
Chung CW, Hsiao TH, Huang CJ, Chen YJ, Chen HH, Lin CH, et al. Machine learning approaches for the genomic prediction of rheumatoid arthritis and systemic lupus erythematosus. BioData Min. 2021;14:1–13.
Article Google Scholar
Petri M, Orbai AM, Alarcón GS, Gordon C, Merrill JT, Fortin PR, et al. Derivation and validation of the Systemic Lupus International Collaborating Clinics classification criteria for systemic lupus erythematosus. Arthritis Rheum. 2012;64:2677–86.
Article PubMed PubMed Central Google Scholar
Wei CY, Yang JH, Yeh EC, Tsai MF, Kao HJ, Lo CZ, et al. Genetic profiles of 103,106 individuals in the Taiwan Biobank provide insights into the health and history of Han Chinese. NPJ Genom Med. 2021;6:10.
Article CAS PubMed PubMed Central Google Scholar
Wu YD, Sheu RK, Chung CW, Wu YC, Ou CC, Hsiao CW, et al. Application of supervised machine learning to recognize competent level and mixed antinuclear antibody patterns based on ICAP international consensus. Diagnostics. 2021;11:642.
Article CAS PubMed PubMed Central Google Scholar
Chan EK, Damoiseaux J, Carballo OG, Conrad K, de Melo CW, Francescantonio PLC, et al. Report of the first international consensus on standardized nomenclature of antinuclear antibody HEp-2 cell patterns 2014–2015. Front Immunol. 2015;6:412.
Article PubMed PubMed Central Google Scholar
Chen LI, Guh JY, Wu KD, Chen YM, Kuo MC, Hwang SJ, et al. Modification of diet in renal disease (MDRD) study and CKD epidemiology collaboration (CKD-EPI) equations for Taiwanese adults. PLoS ONE. 2014;9:e99645.
Article PubMed PubMed Central Google Scholar
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. Journal of Big Data. 2021;8:1–37.
Article Google Scholar
Raju VG, Lakshmi KP, Jain VM, Kalidindi A, Padma V. Study the influence of normalization/transformation process on the accuracy of supervised classification. 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT). IEEE; 2020. p. 729–35.
Zhou X, Chen Y, Ip FC, Lai NC, Li YY, Jiang Y, et al. Genetic and polygenic risk score analysis for Alzheimer’s disease in the Chinese population. Alzheimer’s Dement. 2020;12:e12074.
Google Scholar
Xu J, Ma J, Zeng Y, Si H, Wu Y, Zhang S, et al. Predicting the status of human complex diseases with random forest and polygenic risk scores. 2022. https://doi.org/10.21203/rs.3.rs-1631486/v1.
Montaez CAC, Fergus P, Montaez AC, Hussain A, Al-Jumeily D, Chalmers C. Deep learning classification of polygenic obesity using genome wide association study SNPs. 2018 International Joint Conference on Neural Networks (IJCNN). IEEE; 2018. p. 1–8.
Ceccarelli F, Natalucci F, Picciariello L, Ciancarella C, Dolcini G, Gattamelata A, et al. Application of Machine Learning Models in Systemic Lupus Erythematosus. Int J Mol Sci. 2023;24:4514.
Article CAS PubMed PubMed Central Google Scholar
Martorell-Marugán J, Chierici M, Jurman G, Alarcón-Riquelme ME, Carmona-Sáez P. Differential diagnosis of systemic lupus erythematosus and Sjögren’s syndrome using machine learning and multi-omics data. Comput Biol Med. 2023;152:106373.
Article PubMed Google Scholar
Ai-jun L, Peng Z. Research on Unbalanced Data Processing Algorithm Base Tomeklinks-Smote. Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition. 2020. p. 13–7.
Google Scholar
Meena J, Hasija Y. Application of explainable artificial intelligence in the identification of Squamous Cell Carcinoma biomarkers. Comput Biol Med. 2022;146:105505.
Article CAS PubMed Google Scholar
Li Y, Ma C, Liao S, Qi S, Meng S, Cai W, et al. Combined proteomics and single cell RNA-sequencing analysis to identify biomarkers of disease diagnosis and disease exacerbation for systemic lupus erythematosus. Front Immunol. 2022;13:969509.
Article CAS PubMed PubMed Central Google Scholar
Zhong Y, Zhang W, Hong X, Zeng Z, Chen Y, Liao S, et al. Screening biomarkers for systemic lupus erythematosus based on machine learning and exploring their expression correlations with the ratios of various immune cells. Front Immunol. 2022;13:873787.
Article CAS PubMed PubMed Central Google Scholar
Jiang Z, Shao M, Dai X, Pan Z, Liu D. Identification of diagnostic biomarkers in systemic lupus erythematosus based on bioinformatics analysis and machine learning. Front Genet. 2022;13:865559.
Article CAS PubMed PubMed Central Google Scholar
Ma W, Lau YL, Yang W, Wang YF. Random forests algorithm boosts genetic risk prediction of systemic lupus erythematosus. Front Genet. 2022;13:902793.
Article PubMed PubMed Central Google Scholar
Barnado A, Eudy AM, Blaske A, Wheless L, Kirchoff K, Oates JC, et al. Developing and validating methods to assemble systemic lupus erythematosus births in the electronic health record. Arthritis Care Res (Hoboken). 2022;74:849–57.
Article PubMed PubMed Central Google Scholar
Matthiesen R, Lauber C, Sampaio JL, Domingues N, Alves L, Gerl MJ, et al. Shotgun mass spectrometry-based lipid profiling identifies and distinguishes between chronic inflammatory diseases. EBioMedicine. 2021;70:103504.
Aghabi D, Sloan M, Dou Z, Guerra AJ, Harding CR. The vacuolar iron transporter mediates iron detoxification in Toxoplasma gondii. Biorxiv. 2021;2021.09. 08.458725.
Cloots RH, Sankaranarayanan S, Poynter ME, Terwindt E, van Dijk P, Lamers WH, et al. Arginase 1 deletion in myeloid cells affects the inflammatory response in allergic asthma, but not lung mechanics, in female mice. BMC Pulm Med. 2017;17:1–15.
Article Google Scholar
Hori K, Shimaoka K, Hoshino M. AUTS2 gene: keys to understanding the pathogenesis of neurodevelopmental disorders. Cells. 2021;11:11.
Article PubMed PubMed Central Google Scholar
Fazio G, Bresolin S, Silvestri D, Quadri M, Saitta C, Vendramini E, et al. PAX5 fusion genes are frequent in poor risk childhood acute lymphoblastic leukaemia and can be targeted with BIBF1120. EBioMedicine. 2022;83:04224.
Taylor JM, Wu RM, Farrer MJ, Delatycki MB, Lockhart PJ. Analysis of PArkin Co-Regulated Gene in a Taiwanese-Ethnic Chinese cohort with early-onset Parkinson’s disease. Parkinsonism Relat Disord. 2009;15:417–21.
Article PubMed Google Scholar
Alter A, Fava VM, Huong NT, Singh M, Orlova M, Van Thuc N, et al. Linkage disequilibrium pattern and age-at-diagnosis are critical for replicating genetic associations across ethnic groups in leprosy. Hum Genet. 2013;132:107–16.
Article PubMed Google Scholar
Eng Y, Yao X, Liu K, Risacher S, Saykin A, Long Q, et al. Polygenic mediation analysis of Alzheimer’s disease implicated intermediate amyloid imaging phenotypes. AMIA Annual Symposium Proceedings. 2020;422-431.
Behravan H, Hartikainen JM, Tengström M, Pylkäs K, Winqvist R, Kosma VM, et al. Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls. Sci Rep. 2018;8:1–13.
Article CAS Google Scholar
Ho DSW, Schierding W, Wake M, Saffery R, O’Sullivan J. Machine learning SNP based prediction for precision medicine. Front Genet. 2019;10:267.
Article CAS PubMed PubMed Central Google Scholar
Reid S, Alexsson A, Frodlund M, Morris D, Sandling JK, Bolin K, et al. High genetic risk score is associated with early disease onset, damage accrual and decreased survival in systemic lupus erythematosus. Ann Rheum Dis. 2020;79:363–9.
Article CAS PubMed Google Scholar
Dehestani M, Liu H, Gasser T. Polygenic Risk scores contribute to personalized medicine of Parkinson’s disease. J Pers Med. 2021;11:1030.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank all the participants and investigators from Taiwan Precision Medicine Initiative.

Institutional review board statement

The study was conducted in accordance with the Declaration of Helsinki, and the study protocol was approved by the Institutional Review Board of the TCVGH (IRB no. SF19153A).

Source code

https://github.com/Chung30916/ML-SLE-ANA/commit/0c21a741cbbaad6c73c30be7ac6a3a514cd79efe

Funding

This study was funded by Academia Sinica 40–05-GMM and AS-GC-110-MD02, National Science and Technology Council, Taiwan [NSTC-112–2410-H002-049, NSTC-111–2634-F-A49-014, NSTC-111–2218-E-039–001, and NSTC-111-2314-B-005-007-MY3], and Taichung Veterans General Hospital, Taiwan [TCVGH-1127301C, TCVGH-1127302D, TCVGH-YM1120110, and TCVGH-1127304B].

Author information

Authors and Affiliations

Department of Information Management, National Taiwan University, Taipei, Taiwan
Chih-Wei Chung & Seng-Cho Chou
Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
Tzu-Hung Hsiao & Yi-Ming Chen
Department of Public Health, Fu Jen Catholic University, New Taipei City, Taiwan
Tzu-Hung Hsiao
Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
Tzu-Hung Hsiao
Department of Cellular and Physiological Sciences, The University of British Columbia, Vancouver, BC, Canada
Grace Joyce Zhang
Department of Electrical Engineering, Tunghai University, Taichung, Taiwan
Yu-Fang Chung
Division of Allergy, Immunology and Rheumatology, Department of Internal Medicine, Taichung Veterans General Hospital, 1650, Section 4, Taiwan Boulevard, Xitun Dist., Taichung City, 407, Taiwan
Yi-Ming Chen
Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan
Yi-Ming Chen
School of Medicine, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan
Yi-Ming Chen
Rong Hsing Research Center for Translational Medicine & Ph.D. Program in Translational Medicine, National Chung Hsing University, Taichung, Taiwan
Yi-Ming Chen
Precision Medicine Research Center, College of Medicine, National Chung Hsing University, Taichung, Taiwan
Yi-Ming Chen

Authors

Chih-Wei Chung
View author publications
You can also search for this author in PubMed Google Scholar
Seng-Cho Chou
View author publications
You can also search for this author in PubMed Google Scholar
Tzu-Hung Hsiao
View author publications
You can also search for this author in PubMed Google Scholar
Grace Joyce Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Fang Chung
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Ming Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

CWC and YMC wrote the manuscript. CWC and YMC analyzed the data. CWC, GJZ, and YMC performed the clinical analysis. CWC, SCC, and YMC designed the study. All authors contributed to, read, and approved the final version of the manuscript.

Corresponding author

Correspondence to Yi-Ming Chen.

Ethics declarations

Consent for publication

Not applicable.

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Additional file 2.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Chung, CW., Chou, SC., Hsiao, TH. et al. Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records. BioData Mining 17, 1 (2024). https://doi.org/10.1186/s13040-023-00352-y

Download citation

Received: 14 August 2023
Accepted: 19 December 2023
Published: 05 January 2024
DOI: https://doi.org/10.1186/s13040-023-00352-y

Machine learning approaches to identify systemic lupus erythematosus in anti-nuclear antibody-positive patients using genomic data and electronic health records

Abstract

Background

Methods

Results

Conclusions

Introduction

Materials and methods

Study population & study design

Genotyping

ANA test titers and patterns

Data extraction and preprocessing

Feature selection

Supervised Machine learning approaches

Statistical analysis

Results

Selection of candidate SNPs associated with SLE and non-SLE controls

Baseline characteristics of the study population

Comparison of model performance on the unseen testing set

Interpretation of the risk factors associated with SLE in the ML models

Association between SLE and SNPs among patients with high titer ANA

Prediction of SLE in selected patient populations by ML algorithm

Discussion

Availability of data and materials

References

Acknowledgements

Institutional review board statement

Source code

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Consent for publication

Competing Interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1.

Additional file 2.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us