Skip to main content

Influenza, dengue and common cold detection using LSTM with fully connected neural network and keywords selection

Abstract

Symptom-based machine learning models for disease detection are a way to reduce the workload of doctors when they have too many patients. Currently, there are many research studies on machine learning or deep learning for disease detection or clinical departments classification, using text of patient’s symptoms and vital signs. In this study, we used the Long Short-term Memory (LSTM) with a fully connected neural network model for classification, where the LSTM model was used to receive the patient’s symptoms text as input data. The fully connected neural network was used to receive other input data from the patients, including body temperature, age, gender, and the month the patients received care in. In this research, a data preprocessing algorithm was improved by using keyword selection to reduce the complexity of input data for overfitting problem prevention. The results showed that the LSTM with fully connected neural network model performed better than the LSTM model. The keyword selection method also increases model performance.

Peer Review reports

Introduction

Symptom-based machine learning models help patients self-detect diseases via electronic devices such as smart phones or robots in hospitals with automated question and answer systems [7]. Recently, several studies improved the text classification model for clinical department classification [27] and disease detection [12]. These studies used text from symptoms and other features of patients for disease detection [17].

Dengue fever (a mosquito-borne viral disease) [18] and influenza are dangerous infectious diseases that many people contract. Dengue and influenza have symptoms like the common cold, but they can be fatal. It is estimated that 3 to 5 million people each year become seriously ill due to influenza [21].

The research about machine learning or deep learning for dengue and influenza is divided into two parts, improvement prediction models for forecasting the number of patients [25] or forecasting an outbreak [8] in some areas or countries such as China [26], India [16], and Thailand [22]. Another type of research is focused on improving machine learning or deep learning models for detection of dengue fever and influenza from vital signs [6] and symptoms [1] of patients.

The Long Short-term Memory (LSTM) model is a recurrent neural network model. It is commonly used in text classification [13], time series classification [11], and time series forecasting [25].

In this research, we will use the LSTM model to classify the symptoms of patients as text. The LSTM model was concatenated with a fully connected neural network to use patient vital signs and other features as input data, including gender, body temperature, and age of patients to increase the performance of the classification model. Moreover, we improve our method for data preprocessing by removing words that are not important to classification, this simplifies the input data.

Theorical foundations

In this section, we describe all of the methods we used for modeling in this research.

Mutual information metric

Mutual information metric (MI) is a value used to show the ability to classify each keyword. We use MI to measure the correlation between each keyword and each class. Mutual information metric is denoted by MI(w, c), where w is a word and c is a class. It is calculated by Eq. (1).

$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)} $$
(1)

When fA is the number of documents in class c that contain word w, fB is the number of the documents not in class c that contain word w, fC is the number of the documents not in class c that do not contain word w. and N is the number of all documents. The MI(w, c) has a value in range [ − log(N), log(N)] this is shown in (2) and (3).

$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)}\le \log \frac{N}{\left({f}_A+{f}_B\right)}\le \log (N) $$
(2)
$$ \mathrm{MI}\left(w,c\right)=\log \frac{f_A\bullet N}{\left({f}_A+{f}_C\right)\left({f}_A+{f}_B\right)}\ge \log \frac{f_A}{\left({f}_A+{f}_C\right)}\ge \log \frac{f_A}{\left({f}_A+{f}_C\right)}\ge \log \frac{1}{N}=-\log (N) $$
(3)

The MI of each word can be measured by finding the MI between the word and the class with the highest MI value. It is shown in Eq. (4) where d is the number of classes.

$$ \mathrm{MI}(w)=\underset{i=1:d}{\max}\mathrm{MI}\left(w,{c}_i\right) $$
(4)

The MI is the largest in the case of fA = 1, fB = 0, and fC = 0 .The words that have a frequency of 1 are important for classification.

Word embedding

Word embedding is the method for representing each word with a vector of a real number. Word2vec [15] is a method of word embedding, where neighbors’ vectors of each word represents words with similar meaning. We can set the dimension of the vectors for each word when we train the word2vec model. If we use a pre-train word2vec model, we can use the principal component analysis (PCA) to reduce the dimension of the vector of words to the dimension that we want.

Interpolation

Interpolation is a method for estimating the missing data using polynomial or other functions [2], to obtain some points of data. An example for calculating the missing point of equation y =  sin (x) is shown in Fig. 1.

Fig. 1
figure 1

Data interpolation with linear and cubic functions

LSTM

Long Short-term memory Neural Network (LSTM) [9] is a model architecture for recurrent neural network (RNN). The input data for each record of LSTM model is a sequence of vectors. A structure of LSTM is shown in Fig. 2 where Xt is a vector of input data with time stamp t.

Fig. 2
figure 2

LSTM model structure

The LSTM model is used for classification or prediction of sequential input data. In the present, the LSTM has had several improvements and has been used in several ways for time series prediction and text classification, such as LSTM fully convolutional networks for time series classification [11], bidirectional LSTM for sentiment analysis [13] and medical text classification [7].

Imbalanced data problem

The imbalanced data problem is a problem of data classification, when the number of records in each class is vastly different [19]. In the case of binary class classification, we call the class with more records than the other class the majority class and call the other class the minority class.

There are two popular methods for solving the imbalanced data problem:

  1. 1)

    Using under sampling or oversampling for sampling training data in each class to have the same number of records.

  2. 2)

    Using some loss functions for machine learning or deep learning model to increase the weight of the minority class.

In this research we use the cost-entropy loss function [24] in Eq. (6) for the loss function of LSTM model for solving the imbalanced data problem. It has been improved upon from the cost-entropy loss in Eq. (5) where tk = [tk(1), tk(2), …, tk(d) ] is the vector of target output of kth record of dataset, tk(i)  {0, 1} for i = 1, 2, …, d, and yk = [yk(1), yk(2), …, yk(d) ] is the vector of output of model for kth record of dataset, and yk(i)  (0, 1) for i = 1, 2, …, d. Moreover, we set nk to be the number of records of training data in the class of kth record and set a constant value γ [0, 1].

$$ E=-\sum \limits_{i=1}^n\sum \limits_{k=1}^d{t}_k(i)\log {y}_k(i) $$
(5)
$$ E=-\sum \limits_{i=1}^n\sum \limits_{k=1}^d{t}_k(i)\log {y}_k(i){\left(\frac{1}{n_k}\right)}^{\gamma } $$
(6)

Material and methods

Data description

The data used in this research is from medical records from Saraphi Hospital, Chiang Mai Province, Thailand Between 2015 and 2020 [3,4,5]. We use only records of patients diagnosed with three diseases. This includes the common cold, flu, and dengue. We listed all the attributes we used in this research in Table 1.

Table 1 The attributes are used in this research

The distribution (average and standard deviation) of some features and the number of records for each class are shown in Table 2.

Table 2 The average, standard deviation, and number of patients for some features

From the statistical hypothesis test (t-test), it was found that:

  1. 1)

    Average of age: It was found that the mean of age of common cold patients was greater

than the mean of age of dengue and flu patients (p-value < 0.05), but the mean of age of dengue and flu patients was no different. (p-value > 0.05).

  1. 2)

    Average body temperature: It was found that the mean body temperature of common cold

patients were less than the mean of body temperature of dengue patients (p-value < 0.01), and the mean of body temperature of dengue patients was less than the mean of body temperature of flu patients (p-value < 0.01).

Data preprocessing

In this research, the features used for classification include CHIEFCOMP, GENDER, MONTH_SERV, BTEMP, and AGE. For numerical features (BTEMP and AGE), we use min-max normalization to adjust the values in range [0,1]. Examples of data are shown in Table 3. For MONTH_SERV, we use one hot encoder to convert each value to a vector of integers. For the CHIEFCOMP column, the data in this column is a sentence in the Thai language. We use a python library “pythainlp” [20] for word tokenization. Here is an example of word tokenization, from the sentence “เป็นหวัดมีน้ำมูกไอ” (English: “Having a cold with a runny nose and cough”) to a list of words [“เป็น”, “หวัด”, “มี”, “น้ำมูก”, “ไอ”]. Then the python library “Gensim” [14] is used to create a word2vec model that converts the text of each record into a matrix of a real number.

Table 3 Examples of data in our dataset

Keywords selection

In the process of text preprocessing for LSTM training. We removed words that were not important for classification to simplify the incoming data including:

  1. 1.

    Low MI: words with low mutual information metric (bottoms 5%).

  2. 2.

    Low frequency: words with low frequency (frequency < 2) because it had high MI. That is, it has a high ability for classification. However, it may be a typographical error.

These words are defined as stop words, and all stop words are removed from the data. Next, we set the positions of the removed words to missing values. It is shown in Fig. 3.

Fig. 3
figure 3

Vectors of words in a sentence after the removal of 2 stop words

We use three methods to solve the missing values problem:

  1. 1.

    Cut the stop words: cut the vectors of all stop words in the sentence.

  2. 2.

    Fill with mean: fill the vectors of the missing values by the mean of word2vec of all words in the sentence with the corresponding position.

  3. 3.

    Interpolation: fill the vectors with the missing values by interpolation using the corresponding position in vectors.

We show the example of filling missing values for 2 dimensional word2vec vectors in Fig. 4.

Fig. 4
figure 4

Solving missing values problem

LSTM with fully connected neural network model

For training the models, we divide the data into 3 datasets including: training data, validation data, and testing data. At first, we use all of the words in the CHIEFCOMP column of the dataset to train the word2vec model, then we divide the dataset into two datasets: 80% training and validation data and 20% testing data.

In the next step, we find MI of all words in the training and validation dataset and then cut out the words that have low MI (bottom 5%) and cut out words with frequency less than 2. Next, we solve the missing values problem, and then use the training and validation dataset to train LSTM with the fully connected neural network model, by dividing the training and validation dataset into 80% training and 20% validation data. We show the conceptual framework for our research in Fig. 5. The softmax function in Eq. (7) is used as an activation function for the last layer of the classification model to compute probability of each record in each class where y = [y1, y2, …, yd ] is a vector of real number.

$$ \mathrm{softmax}\left({y}_i\right)=\frac{y_i}{\sum \limits_{j=1}^d\exp \left({y}_j\right)} $$
(7)
Fig. 5
figure 5

Research conceptual framework

Results and discussion

Performance measurement

Since the dataset in this research is an imbalanced dataset, we cannot use accuracy to measure the performance of the model. For this reason we use G-mean (geometric mean of recall) [28] for measurement of the performance models. G-mean is defined in Eq. (9) where d is the number of classes, recall(class ci) is a recall of class ci defined in Eq. (8).

$$ \mathrm{recall}\left(\mathrm{class}\ {c}_i\right)=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{records}\ \mathrm{in}\ \mathrm{class}\ {c}_i\ \mathrm{that}\ \mathrm{true}\ \mathrm{class}\mathrm{ification}}{\mathrm{number}\ \mathrm{of}\ \mathrm{all}\ \mathrm{records}\ \mathrm{in}\ \mathrm{class}\ {c}_i\ } $$
(8)
$$ \mathrm{G}-\mathrm{mean}=\sqrt[d]{\prod \limits_{i=1}^d\mathrm{recall}\left(\mathrm{class}\ {c}_i\right)} $$
(9)

Performance of model

We have shown the performance of all models in Tables 4 and 5. Label-indicator morpheme growth (MG) [10] is the method that adds weight to the keywords with the highest MI (top 5%). SMS spam dataset is the basic dataset for text classification [23]. The model used in this research, was single layer LSTM and single hidden layer neural network (5 hidden nodes) with Adam optimizer in python library “keras”.

Table 4 Performance of models -- Area under the ROC Curve (AUC)
Table 5 Performance of models (G-mean)

For the LSTM model, we use LSTM with no hidden layer and LSTM with single hidden layer (the size of the vector in the hidden layer is 5) for performance comparison. In addition, for the single hidden layer fully connected neural network model, we ran the number of hidden nodes as 5, 10, 15, and 20. Moreover, for the word2vec model, we ran the size of the vector as 20, 25, and 30.

We considered our dataset in three ways, two of which are binary classes. It consists of 1) the common cold and dengue class, 2) the cold and flu class, and the other dataset is the multiple class (common cold, dengue and influenza class). For the SMS spam collection dataset, which is a standard dataset used to test the performance of our method. It consists of two classes, include ham and spam message.

In addition to use the LSTM and LSTM with a fully connected neural network. We also used the LSTM model with numerical features as shown in Fig. 6. to compare the model’s performance.

Fig. 6
figure 6

LSTM model structure

The results showed that LSTM with a fully connected neural network had better performance than normal LSTM. Moreover, removing stop words increased the G-mean value of the testing data for all datasets. For the medical records dataset, LSTM with Fully connected Neural network gives the best G-mean value when words with low MI (bottom 5%) and low frequency (frequency < 2) together are considered stop words. If we set the stop words to be the words with low frequency (frequency < 2), then it reduces the training time (shown in Table 6) and increases the performance of the LSTM model. Moreover, LSTM with the feed forward fully connected neural network model uses less time for training than the LSTM model, because it has a faster convergence.

Table 6 The time of data preprocessing + models training and testing of each model (second). Run on data science server at Chiang Mai University, Thailand (LINUX VPS, RAM 16 GB, CPU INTEL CORE i9, GPU 2080TI 11GB)

Conclusion

This research used the LSTM model with fully connected neural network for dengue fever and influenza detection. Text of symptoms and other features including age, body temperature, gender, and month of service were used for input data. The results showed that the LSTM with the fully connected neural network model had higher performance than the normal LSTM model. In addition, removing unimportant keywords from the dataset and also increased their performance.

Availability of data and materials

Both programming code (python) and data are available upon request (ekkarat.boonchieng@cmu.ac.th).

References

  1. Amin S, Uddin MI, Hassan S, Khan A, Nasser N, Alharbi A, et al. Recurrent neural networks with TF-IDF embedding technique for detection and classification in tweets of dengue disease. IEEE Access. 2020;8:131522–33. https://doi.org/10.1109/ACCESS.2020.3009058e.

    Article  Google Scholar 

  2. Atkinson K. INTERPOLATION. 2003. http://homepage.math.uiowa.edu/~atkinson/ftp/ENA_Materials/Overheads/sec_4-1.pdf.

  3. Boonchieng E, Boonchieng W, Senaratana W, Singkaew J. Development of mHealth for public health information collection, with GIS, using private cloud: A case study of Saraphi district, Chiang Mai, Thailand. In: 2014 International Computer Science and Engineering Conference (ICSEC); 2014. p. 350–3. https://doi.org/10.1109/ICSEC.2014.6978221.

    Chapter  Google Scholar 

  4. Boonchieng W, Boonchieng E, Tuanrat WC, Khuntichot C, Duangchaemkarn K. Integrative system of virtual electronic health record with online community-based health determinant data for home care service: MHealth development and usability test. IEEE Healthc Innov Point Care Technol (HI-POCT). 2017;2017:5–8. https://doi.org/10.1109/HIC.2017.8227571.

    Article  Google Scholar 

  5. Boonchieng W, Chaiwan J, Shrestha B, Shrestha M, Dede AJO, Boonchieng E. mHealth technology translation in a limited resources community—process, challenges, and lessons learned from a limited resources Community of Chiang Mai Province, Thailand. IEEE J Transl Eng Health Med. 2021;9:1–8. https://doi.org/10.1109/JTEHM.2021.3055069.

    Article  Google Scholar 

  6. Briyatis SHU, Premaratne SC, De Silva DGH. A novel method for dengue management based on vital signs and blood profile. Int J Eng Adv Technol. 2019;8(6 special issue 3):154–9. https://doi.org/10.35940/ijeat.F1025.0986S319.

    Article  Google Scholar 

  7. Chen CW, Tseng SP, Kuan TW, Wang JF. Outpatient text classification using attention-based bidirectional LSTM for robot-assisted servicing in hospital. Information (Switzerland). 2020;11(2):106. https://doi.org/10.3390/info11020106.

    Article  CAS  Google Scholar 

  8. Fu B, Yang Y, Ma Y, Hao J, Chen S, Liu S, et al. Attention-based recurrent Multi-Channel neural network for influenza epidemic prediction. In: Proceedings - 2018 IEEE international conference on bioinformatics and biomedicine, BIBM 2018; 2018. p. 1245–8. https://doi.org/10.1109/BIBM.2018.8621467.

    Chapter  Google Scholar 

  9. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451–71. https://doi.org/10.1162/089976600300015015.

    Article  CAS  PubMed  Google Scholar 

  10. Hu Y, Wen G, Ma J, Li D, Wang C, Li H, et al. Label-indicator morpheme growth on LSTM for Chinese healthcare question department classification. J Biomed Inform. 2018;82:154–68. https://doi.org/10.1016/j.jbi.2018.04.011.

    Article  PubMed  Google Scholar 

  11. Karim F, Majumdar S, Darabi H, Chen S. LSTM fully convolutional networks for time series classification. IEEE Access. 2017;6:1662–9. https://doi.org/10.1109/ACCESS.2017.2779939.

    Article  Google Scholar 

  12. Lee SH, Levin D, Finley PD, Heilig CM. Chief complaint classification with recurrent neural networks. J Biomed Inform. 2019;93:103158. https://doi.org/10.1016/j.jbi.2019.103158.

    Article  PubMed  Google Scholar 

  13. Long F, Zhou K, Ou W. Sentiment analysis of text based on bidirectional LSTM with multi-head attention. IEEE Access. 2019;7:141960–9. https://doi.org/10.1109/ACCESS.2019.2942614.

    Article  Google Scholar 

  14. Gensim: Topic modeling for humans. 2019. https://radimrehurek.com/gensim/.

  15. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings; 2013. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85083951332&partnerID=40&md5=20428820e8b09cdfb5078ea812a71f2d.

    Google Scholar 

  16. Murhekar M, Joshua V, Kanagasabai K, Shete V, Ravi M, Ramachandran R, et al. Epidemiology of dengue fever in India, based on laboratory surveillance data, 2014–2017. Int J Infect Dis. 2019;84:S10–4. https://doi.org/10.1016/j.ijid.2019.01.004.

    Article  Google Scholar 

  17. Nadda W, Boonchieng W, Boonchieng E. Dengue fever detection using Long short-term memory neural network. In: 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON 2020; 2020. p. 755–8. https://doi.org/10.1109/ECTI-CON49241.2020.9158315.

    Chapter  Google Scholar 

  18. Nadda W, Boonchieng W, Boonchieng E. Weighted extreme learning machine for dengue detection with class-imbalance classification. In: 2019 IEEE Healthcare Innovations and Point of Care Technologies, (HI-POCT); 2019. p. 151–4. https://doi.org/10.1109/HI-POCT45284.2019.8962825.

    Chapter  Google Scholar 

  19. Petmezas G, Haris K, Stefanopoulos L, Kilintzis V, Tzavelis A, Rogers JA, et al. Automated Atrial Fibrillation Detection using a Hybrid CNN-LSTM Network on Imbalanced ECG Datasets. In: Biomedical Signal Processing and Control; 2021. p. 63. https://doi.org/10.1016/j.bspc.2020.102194.

    Chapter  Google Scholar 

  20. PyThaiNLP. 2020. https://github.com/PyThaiNLP/pythainlp

  21. Rangarajan P, Mody SK, Marathe M. Forecasting dengue and influenza incidences using a sparse representation of Google trends, electronic health records, and time series data. PLoS Comput Biol. 2019;15(11):e1007518. https://doi.org/10.1371/journal.pcbi.1007518.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Rotejanaprasert C, Ekapirat N, Areechokchai D, Maude RJ. Bayesian spatiotemporal modeling with sliding windows to correct reporting delays for real-time dengue surveillance in Thailand. Int J Health Geogr. 2020;19(1):4. https://doi.org/10.1186/s12942-020-00199-0.

    Article  PubMed  PubMed Central  Google Scholar 

  23. SMS Spam Collection Dataset. 2016. https://www.kaggle.com/uciml/sms-spam-collection-dataset.

  24. Tran D, Mac H, Tong V, Tran HA, Nguyen LG. A LSTM based framework for handling multiclass imbalance in DGA botnet detection. Neurocomputing. 2018;275:2401–13. https://doi.org/10.1016/j.neucom.2017.11.018.

    Article  Google Scholar 

  25. Venna SR, Tavanaei A, Gottumukkala RN, Raghavan VV, Maida AS, Nichols S. A novel data-driven model for real-time influenza forecasting. IEEE Access. 2019;7:7691–701. https://doi.org/10.1109/ACCESS.2018.2888585.

    Article  Google Scholar 

  26. Xiao JP, He JF, Deng AP, Lin HL, Song T, Peng ZQ, et al. Characterizing a large outbreak of dengue fever in Guangdong Province, China. Infect Dis Poverty. 2016;5(1):44. https://doi.org/10.1186/s40249-016-0131-z.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Zhao S, Cai Z, Chen H, Wang Y, Liu F, Liu A. Adversarial training based lattice LSTM for Chinese clinical named entity recognition. J Biomed Inf. 2019;99:103290. https://doi.org/10.1016/j.jbi.2019.103290.

    Article  Google Scholar 

  28. Zong W, Huang GB, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing. 2013;101:229–42. https://doi.org/10.1016/j.neucom.2012.08.010.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the Center of Excellence in Community Health Informatics and Fundamental Fund 2022, Chiang Mai University; in part by NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation [Grant Number B05F640183].

Funding

Funding for this work was provided by Center of Excellence in Community Health Informatics and Fundamental Fund 2022, Chiang Mai University; in part by NSRF via the Program Management Unit for Human Resources & Institutional Development, Research and Innovation [Grant Number B05F640183].

Author information

Authors and Affiliations

Authors

Contributions

WN designed and performed the literature review and lead the writing of the paper. WB contributed to the design of the literature review, wrote parts of the paper and gained ethical approval. EB contributed to the design of the literature review, wrote parts of the paper and reviewed the manuscript. All authors reviewed and edited the manuscript and approved the final version of the manuscript.

Corresponding author

Correspondence to Ekkarat Boonchieng.

Ethics declarations

Ethics approval and consent to participate

Ethical approval was obtained from Faculty of Public Health, Chiang Mai University.

(Approval number ET036/2564)

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nadda, W., Boonchieng, W. & Boonchieng, E. Influenza, dengue and common cold detection using LSTM with fully connected neural network and keywords selection. BioData Mining 15, 5 (2022). https://doi.org/10.1186/s13040-022-00288-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13040-022-00288-9

Keywords