- Research
- Open access
- Published:
Influenza, dengue and common cold detection using LSTM with fully connected neural network and keywords selection
BioData Mining volume 15, Article number: 5 (2022)
Abstract
Symptom-based machine learning models for disease detection are a way to reduce the workload of doctors when they have too many patients. Currently, there are many research studies on machine learning or deep learning for disease detection or clinical departments classification, using text of patient’s symptoms and vital signs. In this study, we used the Long Short-term Memory (LSTM) with a fully connected neural network model for classification, where the LSTM model was used to receive the patient’s symptoms text as input data. The fully connected neural network was used to receive other input data from the patients, including body temperature, age, gender, and the month the patients received care in. In this research, a data preprocessing algorithm was improved by using keyword selection to reduce the complexity of input data for overfitting problem prevention. The results showed that the LSTM with fully connected neural network model performed better than the LSTM model. The keyword selection method also increases model performance.
Introduction
Symptom-based machine learning models help patients self-detect diseases via electronic devices such as smart phones or robots in hospitals with automated question and answer systems [7]. Recently, several studies improved the text classification model for clinical department classification [27] and disease detection [12]. These studies used text from symptoms and other features of patients for disease detection [17].
Dengue fever (a mosquito-borne viral disease) [18] and influenza are dangerous infectious diseases that many people contract. Dengue and influenza have symptoms like the common cold, but they can be fatal. It is estimated that 3 to 5 million people each year become seriously ill due to influenza [21].
The research about machine learning or deep learning for dengue and influenza is divided into two parts, improvement prediction models for forecasting the number of patients [25] or forecasting an outbreak [8] in some areas or countries such as China [26], India [16], and Thailand [22]. Another type of research is focused on improving machine learning or deep learning models for detection of dengue fever and influenza from vital signs [6] and symptoms [1] of patients.
The Long Short-term Memory (LSTM) model is a recurrent neural network model. It is commonly used in text classification [13], time series classification [11], and time series forecasting [25].
In this research, we will use the LSTM model to classify the symptoms of patients as text. The LSTM model was concatenated with a fully connected neural network to use patient vital signs and other features as input data, including gender, body temperature, and age of patients to increase the performance of the classification model. Moreover, we improve our method for data preprocessing by removing words that are not important to classification, this simplifies the input data.
Theorical foundations
In this section, we describe all of the methods we used for modeling in this research.
Mutual information metric
Mutual information metric (MI) is a value used to show the ability to classify each keyword. We use MI to measure the correlation between each keyword and each class. Mutual information metric is denoted by MI(w, c), where w is a word and c is a class. It is calculated by Eq. (1).
When fA is the number of documents in class c that contain word w, fB is the number of the documents not in class c that contain word w, fC is the number of the documents not in class c that do not contain word w. and N is the number of all documents. The MI(w, c) has a value in range [ − log(N), log(N)] this is shown in (2) and (3).
The MI of each word can be measured by finding the MI between the word and the class with the highest MI value. It is shown in Eq. (4) where d is the number of classes.
The MI is the largest in the case of fA = 1, fB = 0, and fC = 0 .The words that have a frequency of 1 are important for classification.
Word embedding
Word embedding is the method for representing each word with a vector of a real number. Word2vec [15] is a method of word embedding, where neighbors’ vectors of each word represents words with similar meaning. We can set the dimension of the vectors for each word when we train the word2vec model. If we use a pre-train word2vec model, we can use the principal component analysis (PCA) to reduce the dimension of the vector of words to the dimension that we want.
Interpolation
Interpolation is a method for estimating the missing data using polynomial or other functions [2], to obtain some points of data. An example for calculating the missing point of equation y = sin (x) is shown in Fig. 1.
LSTM
Long Short-term memory Neural Network (LSTM) [9] is a model architecture for recurrent neural network (RNN). The input data for each record of LSTM model is a sequence of vectors. A structure of LSTM is shown in Fig. 2 where Xt is a vector of input data with time stamp t.
The LSTM model is used for classification or prediction of sequential input data. In the present, the LSTM has had several improvements and has been used in several ways for time series prediction and text classification, such as LSTM fully convolutional networks for time series classification [11], bidirectional LSTM for sentiment analysis [13] and medical text classification [7].
Imbalanced data problem
The imbalanced data problem is a problem of data classification, when the number of records in each class is vastly different [19]. In the case of binary class classification, we call the class with more records than the other class the majority class and call the other class the minority class.
There are two popular methods for solving the imbalanced data problem:
-
1)
Using under sampling or oversampling for sampling training data in each class to have the same number of records.
-
2)
Using some loss functions for machine learning or deep learning model to increase the weight of the minority class.
In this research we use the cost-entropy loss function [24] in Eq. (6) for the loss function of LSTM model for solving the imbalanced data problem. It has been improved upon from the cost-entropy loss in Eq. (5) where tk = [tk(1), tk(2), …, tk(d) ] is the vector of target output of kth record of dataset, tk(i) ∈ {0, 1} for i = 1, 2, …, d, and yk = [yk(1), yk(2), …, yk(d) ] is the vector of output of model for kth record of dataset, and yk(i) ∈ (0, 1) for i = 1, 2, …, d. Moreover, we set nk to be the number of records of training data in the class of kth record and set a constant value γ ∈ [0, 1].
Material and methods
Data description
The data used in this research is from medical records from Saraphi Hospital, Chiang Mai Province, Thailand Between 2015 and 2020 [3,4,5]. We use only records of patients diagnosed with three diseases. This includes the common cold, flu, and dengue. We listed all the attributes we used in this research in Table 1.
The distribution (average and standard deviation) of some features and the number of records for each class are shown in Table 2.
From the statistical hypothesis test (t-test), it was found that:
-
1)
Average of age: It was found that the mean of age of common cold patients was greater
than the mean of age of dengue and flu patients (p-value < 0.05), but the mean of age of dengue and flu patients was no different. (p-value > 0.05).
-
2)
Average body temperature: It was found that the mean body temperature of common cold
patients were less than the mean of body temperature of dengue patients (p-value < 0.01), and the mean of body temperature of dengue patients was less than the mean of body temperature of flu patients (p-value < 0.01).
Data preprocessing
In this research, the features used for classification include CHIEFCOMP, GENDER, MONTH_SERV, BTEMP, and AGE. For numerical features (BTEMP and AGE), we use min-max normalization to adjust the values in range [0,1]. Examples of data are shown in Table 3. For MONTH_SERV, we use one hot encoder to convert each value to a vector of integers. For the CHIEFCOMP column, the data in this column is a sentence in the Thai language. We use a python library “pythainlp” [20] for word tokenization. Here is an example of word tokenization, from the sentence “เป็นหวัดมีน้ำมูกไอ” (English: “Having a cold with a runny nose and cough”) to a list of words [“เป็น”, “หวัด”, “มี”, “น้ำมูก”, “ไอ”]. Then the python library “Gensim” [14] is used to create a word2vec model that converts the text of each record into a matrix of a real number.
Keywords selection
In the process of text preprocessing for LSTM training. We removed words that were not important for classification to simplify the incoming data including:
-
1.
Low MI: words with low mutual information metric (bottoms 5%).
-
2.
Low frequency: words with low frequency (frequency < 2) because it had high MI. That is, it has a high ability for classification. However, it may be a typographical error.
These words are defined as stop words, and all stop words are removed from the data. Next, we set the positions of the removed words to missing values. It is shown in Fig. 3.
We use three methods to solve the missing values problem:
-
1.
Cut the stop words: cut the vectors of all stop words in the sentence.
-
2.
Fill with mean: fill the vectors of the missing values by the mean of word2vec of all words in the sentence with the corresponding position.
-
3.
Interpolation: fill the vectors with the missing values by interpolation using the corresponding position in vectors.
We show the example of filling missing values for 2 dimensional word2vec vectors in Fig. 4.
LSTM with fully connected neural network model
For training the models, we divide the data into 3 datasets including: training data, validation data, and testing data. At first, we use all of the words in the CHIEFCOMP column of the dataset to train the word2vec model, then we divide the dataset into two datasets: 80% training and validation data and 20% testing data.
In the next step, we find MI of all words in the training and validation dataset and then cut out the words that have low MI (bottom 5%) and cut out words with frequency less than 2. Next, we solve the missing values problem, and then use the training and validation dataset to train LSTM with the fully connected neural network model, by dividing the training and validation dataset into 80% training and 20% validation data. We show the conceptual framework for our research in Fig. 5. The softmax function in Eq. (7) is used as an activation function for the last layer of the classification model to compute probability of each record in each class where y = [y1, y2, …, yd ] is a vector of real number.
Results and discussion
Performance measurement
Since the dataset in this research is an imbalanced dataset, we cannot use accuracy to measure the performance of the model. For this reason we use G-mean (geometric mean of recall) [28] for measurement of the performance models. G-mean is defined in Eq. (9) where d is the number of classes, recall(class ci) is a recall of class ci defined in Eq. (8).
Performance of model
We have shown the performance of all models in Tables 4 and 5. Label-indicator morpheme growth (MG) [10] is the method that adds weight to the keywords with the highest MI (top 5%). SMS spam dataset is the basic dataset for text classification [23]. The model used in this research, was single layer LSTM and single hidden layer neural network (5 hidden nodes) with Adam optimizer in python library “keras”.
For the LSTM model, we use LSTM with no hidden layer and LSTM with single hidden layer (the size of the vector in the hidden layer is 5) for performance comparison. In addition, for the single hidden layer fully connected neural network model, we ran the number of hidden nodes as 5, 10, 15, and 20. Moreover, for the word2vec model, we ran the size of the vector as 20, 25, and 30.
We considered our dataset in three ways, two of which are binary classes. It consists of 1) the common cold and dengue class, 2) the cold and flu class, and the other dataset is the multiple class (common cold, dengue and influenza class). For the SMS spam collection dataset, which is a standard dataset used to test the performance of our method. It consists of two classes, include ham and spam message.
In addition to use the LSTM and LSTM with a fully connected neural network. We also used the LSTM model with numerical features as shown in Fig. 6. to compare the model’s performance.
The results showed that LSTM with a fully connected neural network had better performance than normal LSTM. Moreover, removing stop words increased the G-mean value of the testing data for all datasets. For the medical records dataset, LSTM with Fully connected Neural network gives the best G-mean value when words with low MI (bottom 5%) and low frequency (frequency < 2) together are considered stop words. If we set the stop words to be the words with low frequency (frequency < 2), then it reduces the training time (shown in Table 6) and increases the performance of the LSTM model. Moreover, LSTM with the feed forward fully connected neural network model uses less time for training than the LSTM model, because it has a faster convergence.
Conclusion
This research used the LSTM model with fully connected neural network for dengue fever and influenza detection. Text of symptoms and other features including age, body temperature, gender, and month of service were used for input data. The results showed that the LSTM with the fully connected neural network model had higher performance than the normal LSTM model. In addition, removing unimportant keywords from the dataset and also increased their performance.
Availability of data and materials
Both programming code (python) and data are available upon request (ekkarat.boonchieng@cmu.ac.th).
References
Amin S, Uddin MI, Hassan S, Khan A, Nasser N, Alharbi A, et al. Recurrent neural networks with TF-IDF embedding technique for detection and classification in tweets of dengue disease. IEEE Access. 2020;8:131522–33. https://doi.org/10.1109/ACCESS.2020.3009058e.
Atkinson K. INTERPOLATION. 2003. http://homepage.math.uiowa.edu/~atkinson/ftp/ENA_Materials/Overheads/sec_4-1.pdf.
Boonchieng E, Boonchieng W, Senaratana W, Singkaew J. Development of mHealth for public health information collection, with GIS, using private cloud: A case study of Saraphi district, Chiang Mai, Thailand. In: 2014 International Computer Science and Engineering Conference (ICSEC); 2014. p. 350–3. https://doi.org/10.1109/ICSEC.2014.6978221.
Boonchieng W, Boonchieng E, Tuanrat WC, Khuntichot C, Duangchaemkarn K. Integrative system of virtual electronic health record with online community-based health determinant data for home care service: MHealth development and usability test. IEEE Healthc Innov Point Care Technol (HI-POCT). 2017;2017:5–8. https://doi.org/10.1109/HIC.2017.8227571.
Boonchieng W, Chaiwan J, Shrestha B, Shrestha M, Dede AJO, Boonchieng E. mHealth technology translation in a limited resources community—process, challenges, and lessons learned from a limited resources Community of Chiang Mai Province, Thailand. IEEE J Transl Eng Health Med. 2021;9:1–8. https://doi.org/10.1109/JTEHM.2021.3055069.
Briyatis SHU, Premaratne SC, De Silva DGH. A novel method for dengue management based on vital signs and blood profile. Int J Eng Adv Technol. 2019;8(6 special issue 3):154–9. https://doi.org/10.35940/ijeat.F1025.0986S319.
Chen CW, Tseng SP, Kuan TW, Wang JF. Outpatient text classification using attention-based bidirectional LSTM for robot-assisted servicing in hospital. Information (Switzerland). 2020;11(2):106. https://doi.org/10.3390/info11020106.
Fu B, Yang Y, Ma Y, Hao J, Chen S, Liu S, et al. Attention-based recurrent Multi-Channel neural network for influenza epidemic prediction. In: Proceedings - 2018 IEEE international conference on bioinformatics and biomedicine, BIBM 2018; 2018. p. 1245–8. https://doi.org/10.1109/BIBM.2018.8621467.
Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451–71. https://doi.org/10.1162/089976600300015015.
Hu Y, Wen G, Ma J, Li D, Wang C, Li H, et al. Label-indicator morpheme growth on LSTM for Chinese healthcare question department classification. J Biomed Inform. 2018;82:154–68. https://doi.org/10.1016/j.jbi.2018.04.011.
Karim F, Majumdar S, Darabi H, Chen S. LSTM fully convolutional networks for time series classification. IEEE Access. 2017;6:1662–9. https://doi.org/10.1109/ACCESS.2017.2779939.
Lee SH, Levin D, Finley PD, Heilig CM. Chief complaint classification with recurrent neural networks. J Biomed Inform. 2019;93:103158. https://doi.org/10.1016/j.jbi.2019.103158.
Long F, Zhou K, Ou W. Sentiment analysis of text based on bidirectional LSTM with multi-head attention. IEEE Access. 2019;7:141960–9. https://doi.org/10.1109/ACCESS.2019.2942614.
Gensim: Topic modeling for humans. 2019. https://radimrehurek.com/gensim/.
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings; 2013. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951332&partnerID=40&md5=20428820e8b09cdfb5078ea812a71f2d.
Murhekar M, Joshua V, Kanagasabai K, Shete V, Ravi M, Ramachandran R, et al. Epidemiology of dengue fever in India, based on laboratory surveillance data, 2014–2017. Int J Infect Dis. 2019;84:S10–4. https://doi.org/10.1016/j.ijid.2019.01.004.
Nadda W, Boonchieng W, Boonchieng E. Dengue fever detection using Long short-term memory neural network. In: 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON 2020; 2020. p. 755–8. https://doi.org/10.1109/ECTI-CON49241.2020.9158315.
Nadda W, Boonchieng W, Boonchieng E. Weighted extreme learning machine for dengue detection with class-imbalance classification. In: 2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT); 2019. p. 151–4. https://doi.org/10.1109/HI-POCT45284.2019.8962825.
Petmezas G, Haris K, Stefanopoulos L, Kilintzis V, Tzavelis A, Rogers JA, et al. Automated Atrial Fibrillation Detection using a Hybrid CNN-LSTM Network on Imbalanced ECG Datasets. In: Biomedical Signal Processing and Control; 2021. p. 63. https://doi.org/10.1016/j.bspc.2020.102194.
PyThaiNLP. 2020. https://github.com/PyThaiNLP/pythainlp
Rangarajan P, Mody SK, Marathe M. Forecasting dengue and influenza incidences using a sparse representation of Google trends, electronic health records, and time series data. PLoS Comput Biol. 2019;15(11):e1007518. https://doi.org/10.1371/journal.pcbi.1007518.
Rotejanaprasert C, Ekapirat N, Areechokchai D, Maude RJ. Bayesian spatiotemporal modeling with sliding windows to correct reporting delays for real-time dengue surveillance in Thailand. Int J Health Geogr. 2020;19(1):4. https://doi.org/10.1186/s12942-020-00199-0.
SMS Spam Collection Dataset. 2016. https://www.kaggle.com/uciml/sms-spam-collection-dataset.
Tran D, Mac H, Tong V, Tran HA, Nguyen LG. A LSTM based framework for handling multiclass imbalance in DGA botnet detection. Neurocomputing. 2018;275:2401–13. https://doi.org/10.1016/j.neucom.2017.11.018.
Venna SR, Tavanaei A, Gottumukkala RN, Raghavan VV, Maida AS, Nichols S. A novel data-driven model for real-time influenza forecasting. IEEE Access. 2019;7:7691–701. https://doi.org/10.1109/ACCESS.2018.2888585.
Xiao JP, He JF, Deng AP, Lin HL, Song T, Peng ZQ, et al. Characterizing a large outbreak of dengue fever in Guangdong Province, China. Infect Dis Poverty. 2016;5(1):44. https://doi.org/10.1186/s40249-016-0131-z.
Zhao S, Cai Z, Chen H, Wang Y, Liu F, Liu A. Adversarial training based lattice LSTM for Chinese clinical named entity recognition. J Biomed Inf. 2019;99:103290. https://doi.org/10.1016/j.jbi.2019.103290.
Zong W, Huang GB, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing. 2013;101:229–42. https://doi.org/10.1016/j.neucom.2012.08.010.
Acknowledgments
This work was supported in part by the Center of Excellence in Community Health Informatics and Fundamental Fund 2022, Chiang Mai University; in part by NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation [Grant Number B05F640183].
Funding
Funding for this work was provided by Center of Excellence in Community Health Informatics and Fundamental Fund 2022, Chiang Mai University; in part by NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation [Grant Number B05F640183].
Author information
Authors and Affiliations
Contributions
WN designed and performed the literature review and lead the writing of the paper. WB contributed to the design of the literature review, wrote parts of the paper and gained ethical approval. EB contributed to the design of the literature review, wrote parts of the paper and reviewed the manuscript. All authors reviewed and edited the manuscript and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Ethical approval was obtained from Faculty of Public Health, Chiang Mai University.
(Approval number ET036/2564)
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Nadda, W., Boonchieng, W. & Boonchieng, E. Influenza, dengue and common cold detection using LSTM with fully connected neural network and keywords selection. BioData Mining 15, 5 (2022). https://doi.org/10.1186/s13040-022-00288-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040-022-00288-9