Detecting diseases in medical prescriptions using data mining methods

Nazari Nezhad, Sana; Zahedi, Mohammad H.; Farahani, Elham

doi:10.1186/s13040-022-00314-w

Research
Open access
Published: 24 November 2022

Detecting diseases in medical prescriptions using data mining methods

Sana Nazari Nezhad¹,
Mohammad H. Zahedi¹ &
Elham Farahani²

BioData Mining volume 15, Article number: 29 (2022) Cite this article

4381 Accesses
2 Citations
Metrics details

Abstract

Every year, the health of millions of people around the world is compromised by misdiagnosis, which sometimes could even lead to death. In addition, it entails huge financial costs for patients, insurance companies, and governments. Furthermore, many physicians’ professional life is adversely affected by unintended errors in prescribing medication or misdiagnosing a disease. Our aim in this paper is to use data mining methods to find knowledge in a dataset of medical prescriptions that can be effective in improving the diagnostic process. In this study, using 4 single classification algorithms including decision tree, random forest, simple Bayes, and K-nearest neighbors, the disease and its category were predicted. Then, in order to improve the performance of these algorithms, we used an Ensemble Learning methodology to present our proposed model. In the final step, a number of experiments were performed to compare the performance of different data mining techniques. The final model proposed in this study has an accuracy and kappa score of 62.86% and 0.620 for disease prediction and 74.39% and 0.720 for prediction of the disease category, respectively, which has better performance than other studies in this field.

In general, the results of this study can be used to help maintain the health of patients, and prevent the wastage of the financial resources of patients, insurance companies, and governments. In addition, it can aid physicians and help their careers by providing timely information on diagnostic errors. Finally, these results can be used as a basis for future research in this field.

Peer Review reports

Introduction

Studies show that 12 million people worldwide are affected by medical misdiagnosis each year, which means that an average of one in 20 patients is misdiagnosed, with 10 to 20% of those in critical condition. An estimated 40,000 to 80,000 people die each year as a result of these misdiagnoses, with women and minorities typically more affected by between 20 and 30%. In general, 44% of cancers are associated with misdiagnosis, of which the three cancers of prostate, breast, and thyroid have the highest rate of misdiagnosis. 51% of people have encountered a different diagnosis after a breast x-ray when asked for another doctor’s opinion [1].

Studies also show that one-third of medical errors that result in death or disability result from a misdiagnosis or late diagnosis. Misdiagnosis has several complications, the most important of which are unnecessary treatment, increased costs for the patient and the government, physical and emotional stress, and even death [1].

As mentioned, misdiagnosis leads to high costs, for example, the researchers found that diagnostic errors were the leading reason for paid malpractice claims (28.6%) and were responsible for the highest proportion of total payments (35.2%). The researchers estimated that the 2011 inflation-adjusted mean and median per claim payout for diagnostic errors were $386,849 and $213,250, respectively. Also, over 10 years, the amount of compensation paid for diagnostic errors has been $1.8 billion [1].

Improving the diagnostic process is not only possible but also a moral, professional and public health necessity. Therefore, predicting the disease is very important for reducing costs and time overheads and helping the doctor in making decisions. These are the reasons why prescription data can play a vital role in any community to help promote community health [1].

On the other hand, the volume of data is increasing day by day so the need to understand a rich set of data has increased today in all fields including technology, business, and especially medicine. The vast amount of data generated in the medical industry about patients, hospital resources, disease diagnosis, electronic health records, medical equipment, and the like is considered a huge resource that needs to be processed and analyzed in order to save money and to assist physicians in making their decisions [2, 3]. To this end, data mining in the healthcare industry provides a set of tools and methods that can be applied to data to discover hidden patterns in it. The data mining techniques can generally be divided into descriptive and predictive categories. Descriptive methods include clustering and Association rules, and predictive methods include classification and forecasting [3, 4].

Our goal in this study is to use data mining methods to find knowledge in a dataset of medical prescriptions provided by the www.Drugs.com site. By analyzing the prescription drugs for each disease, our proposed method aims to predict the category of each disease and the type of disease that the patient suffers from. Different classification methods have been used to predict diseases based on prescription drugs. Experiments show that the results of the predictions are acceptable. The remainder of this paper is organized as follows: Section 2 deals with the background. The proposed method is explained in Section 3. Section 4 presents the results and the discussions. Section 5 concludes the paper. Finally, Section 6 presents the declarations.

Background

Problem statement

Annual misdiagnosis is costly for patients, physicians, insurance companies, and governments. A significant percentage of people around the world incur exorbitant costs due to being prescribed the inappropriate drug, which can, in turn, be the result of misdiagnosis of their disease. The incurred costs, include financial expenses and adverse impact on their health, which in many cases lead to new diseases or even death of the patient. On the other hand, the medical community is not immune to losses resulting from misdiagnoses. A doctor may mistakenly prescribe medication or misdiagnose a disease. This, can lead to disability or even death of a patient, and can negatively affect the progress of the doctor’s career. Following a misdiagnosis, the insurance companies will also incur financial losses by paying the relevant penalty. The fourth entity affected by misdiagnosis is the government, which usually spends huge sums of money annually on importing medicines or allocating capital to drug companies for manufacturing drugs. Especially in recent years, it has been observed that many governments have faced considerable problems due to shortage of a particular drug at some point in time. This can lead to substantial increase in the price of the drug and, in turn, can result in many patients not being adequately treated or even die. On the other hand, through unnecessary import or excess production of some drugs, substantial financial resources may be wasted because the excess drugs have a fixed expiry date and cannot be used thereafter [1, 5,6,7,8].

Therefore, providing solutions that can help in the timely detection of drug errors can not only save the lives of many people but can also significantly reduce the cost to patients. It can also be of great help to a large percentage of physicians who will be able to correct their errors in a timely manner. In addition, it can reduce the cost to insurance companies of compensating for misdiagnosis errors. It can also aid the governments’ budgets in the long run. In this way, by providing reliable statistics in a specific time period (for example, 10 years), the amount and type of medication prescribed by doctors for different patients are determined [1].

Hence, predicting the disease is not only important for reducing costs and time overheads and helping the doctors in making decisions, but can also help the governments in numerous fields [1].

Literature review

In recent years, many studies on the prediction of various diseases, their treatments, and drug discovery have been performed around the world. Different data mining techniques have been used for disease detection and different results have been obtained. The following is a description of these studies for several diseases such as heart disease, diabetes, cancer, etc.

The heart disease has become one of the most common diseases in humans, so today the prediction and diagnosis of cardiovascular diseases at an early stage are necessary in order to reduce mortality from this disease. In recent years, many studies have been conducted in this field, including:

Kondababu et al. (2021) have predicted heart disease using machine learning algorithms. In their study, they discussed many existing methods, among which the proposed HRFLM technique, which uses a combination of random forest (RF) characteristics and linear method (LM), was very accurate with an accuracy level of 88.7% [9]. Jeyaranjani et al. (2021) developed a decision support system based on a supervised learning model for deciding the status of coronary heart disease angiography. The results of their study present the ANN model with 97% accuracy in predicting disease stages. This decision support system helps in early detection [10]. Jothi et al. (2021) proposed a model for predicting heart disease using the decision tree algorithm. In their study, the Decision Tree algorithm can be used on the data set to predict the patient’s risk of heart disease with an accuracy of 81% [11]. Pavithra and Jayalakshmi (2021) proposed a new HRFLC feature selection technique (random forest + AdaBoost + Pearson coefficient). This method helps to predict diseases in a very efficient way and improves the level of accuracy in forecasting [12]. Ramesh et al. (2021) proposed a feature selection algorithm that enhances the performance of any ML approach and is known as Information Gain-based Feature Selection (IGFS). In their study, SVM and RF algorithms showed the highest performance with an accuracy rate of 88% [13]. Maini et al. (2021) proposed a machine learning-based heart disease prediction system for the Indian population. Their proposed system works well for the early detection of cardiovascular disease and can be accessed via the Internet. The best performance RF algorithms have accuracy, sensitivity, and specificity of 93.8, 92.8, and 94.6%, respectively [14]. Kumar and Sahoo (2015) have proposed a new algorithm which combines simple Bayesian and genetic algorithms to improve the classification of heart disease. In this algorithm, classification learns to categorize heart disease datasets into sick or healthy categories. Experimental results obtained from 6 data sets in their study show that the proposed approach is an effective method for classification. Their predictive model assists physicians in the process of efficiently diagnosing heart disease with fewer features [15].

Diabetes is another major medical problem that causes many deaths in the world every year, which is why many studies have been done to predict it, including:

Jain et al. (2021) predicted diabetes using artificial intelligence algorithms on the Pima Indians Diabetes dataset. In their study, the neural network algorithm with 87.88% accuracy achieved the best performance, which can be useful for physicians in the treatment of this disease in its early stages [16]. Kumari et al. (2021) have proposed a soft voting classifier model with a set of three algorithms such as random forest, logistic regression, and simple Bayes to predict diabetic patients. They applied their proposed model to the Pima Indians Diabetes Database and the Breast Cancer Database. Their proposed model has an accuracy of 79.08% in the diabetes dataset and 97.02% in the breast cancer dataset [17]. Khaleel and Al-Bakry (2021) proposed a model that can predict whether a person has diabetes. The results show that the proposed logistic regression with 94% accuracy was more effective in predicting diabetes than other algorithms [18].

Even though there are different data mining classification algorithms for predicting heart disease, there is not enough data to predict heart disease in a diabetic person. Arumugam et al. (2021) adjusted the decision tree model for optimal performance in predicting the chance of heart disease in diabetic patients because it consistently outperformed the simple vector and simple Bayesian models [19].

In today’s world, cancer has become one of the leading causes of death and breast cancer is one of the main causes of death among women worldwide. Therefore, a great deal of research has been conducted in this field, including:

Because early detection and intervention of lymphedema are essential for improving the quality of life of breast cancer survivors, Wei et al. (2021) conducted their study with the aim of developing a symptom warning model for early detection of breast cancer-related lymphedema. Their proposed logistic regression model showed the best performance with AUC = 0.889 (0.840–0.938), sensitivity = 0.771, specificity = 0.883, accuracy = 0.825, and Brier scores = 0.141 and the calibration was acceptable [20]. Dhanya et al. (2020) used existing ensemble techniques along with a combination of supervised machine learning algorithms to develop a new model for predicting breast cancer. Because not all features are necessary to predict breast cancer, and feature selection helps to build an efficient model in such scenarios, they used feature selection techniques. According to the obtained results, it was observed that their proposed stacking ensemble method is an effective and reliable method for predicting breast cancer by f-test feature selection [21]. Onan (2015) has developed a method for creating a cancer diagnosis system that combines the classification of fuzzy-rough nearest neighbors, consistency-based subset evaluation, and fuzzy-rough instance selection technique. This method uses feature selection to improve comprehensibility, shorten training time, and generalize the model. The evaluation results show that the proposed method has 99.71% accuracy and can be used as a reliable tool for automatic diagnosis of breast cancer [22].

In modern times, obesity has become a major threat to health worldwide. Obesity can lead to the development of complex diseases such as stroke, heart disease and liver cancer. Ferdowsy et al. (2021) predicted the risk of obesity using machine learning algorithms. The results show that their proposed logistic regression algorithm has a good performance with 97.09% accuracy [23].

Chronic kidney disease (CKD) is a condition characterized by the gradual loss of kidney function over time. It is usually asymptomatic in its early stages, and early detection is important to reduce future risks. Pinto et al. (2020) used the CRISP-DM method to construct a system that predicts chronic kidney disease conditions. The obtained results show that their proposed J48 algorithm achieved the most suitable result, namely 97.66% accuracy, 96.13% sensitivity, 98.78% specificity and 98.31% precision [24].

Despite long-term efforts to control and prevent medical errors and increase patient safety, medical errors are still one of the leading causes of death in the world, the costs of which attract the attention of policymakers, health care planners and researchers.

Ahsani-Estahbanati et al. (2021) estimate the incidence rate of medical errors both in Iran and worldwide, elicit factors that affect incident rates, estimate the economic burden of medical errors, and outline international and national interventions that can be made to reduce medical errors. Finally, to draw policymakers’ attention to this critical issue, it provides a policy brief related to strategies for dealing with medical errors and associated costs reduction [25].

Today, early diagnosis is a necessity. Malladi et al. (2021) predicted disease through machine learning based on symptoms. According to the results, the CNN algorithm was 84.5% more reliable than the KNN algorithm for predicting a general disease [26].

Dehkordi et al. (2019) predicted what type of physician, public or private, each patient has been referred to and the type of disease he was suffering from. In this study, the dataset includes 70 different types of diseases and 386 different types of drugs and has a total of 600 records. They used a stacking method to improve the prediction model. The results showed that the accuracy for predicting the type of physician was 73.17% and for predicting the type of disease was 57% [27].

Given that data about the prevalence of communicable and non-communicable diseases, as one of the most important categories of epidemiological data, is used for interpreting the health status of communities, Teimouri et al. (2016) calculated the prevalence of outpatient diseases through the characterization of outpatient prescriptions. Among the classification techniques used in this study, the support vector machine with 95.32% accuracy showed the best performance. In the next stage, combining methods are used to improve the results of the individual data mining algorithms. Among these combining methods, Weighted Voting algorithms with an accuracy of 97.16% has the best performance [28].

Trasierras et al. (2022) presented an approach based on emerging pattern mining to analyze cancer through genomic data. Their proposed model includes four different procedures that are specifically designed to deal with RNA-Seq data on cancer. Unlike existing approaches, which are mainly focused on predictive purposes, their proposal aims to improve the understanding of cancer descriptively, not requiring either any prior knowledge or hypothesis to be validated [29].

Frias et al. (2021) improved the prediction of hepatitis C virus outcome using a data mining approach. Their data mining approach identified genetic patterns that escaped detection using conventional statistics. More specifically, the partial decision trees and ensemble models increased the classification accuracy of hepatitis C virus outcome compared with conventional methods [30].

Table 1 compares the above studies.

Table 1 Analysis of data mining methods for the above studies

Detecting diseases in medical prescriptions using data mining methods

Abstract

Introduction

Background

Problem statement

Literature review

Methods

Method: stacking

The proposed method

Data collection

Modeling

Model

Discussion

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BioData Mining

Contact us