Skip to main content

Diagnosis of thyroid nodules for ultrasonographic characteristics indicative of malignancy using random forest

Abstract

Background

Various combinations of ultrasonographic (US) characteristics are increasingly utilized to classify thyroid nodules. But they lack theories, and heavily depend on radiologists’ experience, and cannot correctly classify thyroid nodules. Hence, our main purpose of this manuscript is to select the US characteristics significantly associated with malignancy and to develop an efficient scoring system for facilitating ultrasonic clinicians to correctly identify thyroid malignancy.

Methods

A logistic regression (LR) model is utilized to identify the potential thyroid malignancy, and the least absolute shrinkage and selection operator (LASSO) method is adopted to simultaneously select US characteristics significantly associated with malignancy and estimate parameters in LR model. Based on the selected US characteristics, we calculate the probability for each of thyroid nodules via random forest (RF) and extreme learning machine (ELM), and develop a scoring system to classify thyroid nodules. For comparison, we also consider eight state-of-the-art methods such as support vector machine (SVM), neural network (NET), etc. The area under the receiver operating characteristic curve (AUC) is employed to measure the accuracy of various classifiers.

Results

The US characteristics: nodule size, AP/T≥1, solid component, micro-calcifications, hackly border, hypoechogenicity, presence of halo, unclear border, irregular margin, and central vascularity are selected as the significant predictors associated with thyroid malignancy via the LASSO LR (LLR). Using the developed scoring system, thyroid nodules are classified into the following four categories: benign, low suspicion, intermediate suspicion, and high suspicion, whose rates of malignancy correctly identified for RF (ELM) method on the testing dataset are 0.0% (4.3%), 14.3% (50.0%), 58.1% (59.1%) and 96.1% (97.7%), respectively.

Conclusion

LLR together with RF performs better than other methods in identifying malignancy, especially for abnormal nodules, in terms of risk scores. The developed scoring system can well predict the risk of malignancy and guide medical doctors to make management decisions for reducing the number of unnecessary biopsies for benign nodules.

Peer Review reports

Background

With the development of new ultrasound technology and the popularity of high-resolution scanners, it is no longer challenging to detect thyroid nodules. However, for most sonographers, the critical challenge is to distinguish both malignant thyroid nodules and benign ones. To this end, some US characteristics, such as the presence of unclear border, micro-calcifications, irregular shape, solid component, inner echo [13], are widely adopted to assess nodules at risk for malignancy. Some studies have shown that only using one of the US characteristics mentioned above is impossible to correctly distinguish between malignant nodules and benign ones [4]. Many malignant nodules usually have more than two representative characteristics. Therefore, it may be rather desirable to develop an efficient approach to improve the diagnostic accuracy for thyroid malignancy by incorporating multiple characteristics mentioned above. On the other hand, the US examination can provide many potential characteristics, but some of them are inactive for the diagnosis of thyroid cancer. Thus, distinguishing inactive characteristics and active ones may largely improve the accuracy of the diagnosis of thyroid malignancy.

In previous studies [1, 511], different versions of thyroid imaging reporting and data systems (TI-RADS) were proposed for thyroid nodule diagnosis and management by considering different combinations of US characteristics. Although these systems can be used to improve the efficiency of thyroid nodule diagnosis compared with the traditional subjective diagnosis, they did not provide a quantitative approach to assess the risk of the malignant tumor. To this end, in 2017, the American college of radiology (ACR) published an ACR system [11] for estimating the risk of malignancy, in which TI-RADS scores were calculated from 5 categories of US characteristics. For this, the ACR TI-RADS has been widely applied to thyroid nodule diagnosis now. However, the cumulative score calculated from 5 categories of US characteristics still heavily relies on the radiologist’s description for the used characteristics, and the efficiency of the scoring system varies with the radiologist’s experience. Moreover, existing approaches mentioned above sometimes behave poorly and lack theories. To overcome these defects, in a recent study [12], machine learning algorithms such as random forest (RF), kernel support vector machine (SVM), neural network (NET), etc., were introduced to classify thyroid nodules into two kinds: benign and malign based on the used US characteristics. But they did not consider which US characteristics were active and which ones were inactive in detecting malign. Moreover, they only considered two types of thyroid nodules, which were impossible for patients to understand the phase of thyroid nodules. To our knowledge, there is little work on the scoring system developed by RF in the differentiated diagnosis of thyroid nodules. Hence, the main purpose of this paper is to develop an objective and quantitative scoring system to assist ultrasonic clinicians for identifying the thyroid cancer by (i) adopting a LASSO method to efficiently select the critical US characteristics significantly associated with malignancy as potential predictors of malignancy; (ii) using machine learning algorithms to calculate the class probability of each nodule, which is utilized to classify for each nodule; (iii) proposing a scoring system that can be used to predict the risk for malignancy and guide medical doctors to make management decisions for reducing the number of unnecessary biopsies for benign nodules.

Materials and methods

Thyroid nodules

Consider a dataset with 1558 thyroid nodules for 1480 patients collected during the period from Jan. 2011 to Apr. 2016 at The First Affiliated Hospital of Kunming Medical University in China. In this dataset, 110 thyroid nodules for 110 patients (94 females and 16 males) can be regarded as outliers detected by traditional LR analysis and test for score [13]. Among these outliers, 68.7% (68/99) of benign nodules has at least 3 known US malignancy characteristics, and 72.7% (8/11) malignant nodules has at least four benign characteristics. It is difficult to differentiate between malignant and benign nodules for these outliers only based on the US characteristics selected by existing methods [13]. Therefore, these outliers are regarded as abnormal nodules, and the remaining 1448 nodules for 1370 patients (286 male and 1084 female) are deemed as disease nodules. Among 1370 patients, the oldest patient is 80 years old, and the youngest patient is only 10 years old. Surgery has been performed on all the nodules. Table 1 presents the numbers of begin and malignant nodules for female and male groups, respectively, and means of patients’ ages for disease and abnormal nodules, respectively. Examination of Table 1 shows that (i) malignant patients are younger than benign patients in disease nodule group, (ii) benign patients are younger than malignant ones in abnormal nodule group. Although age is not a US characteristic in traditional diagnosis of thyroid nodules, these observations indicate that age may be an important factor associated with the malignant.

Table 1 Age and gender distribution of cases in disease nodules and abnormal nodules

US characteristics

We used GE LOGIQ E9 and HITACHI for ultrasonic scanning and performed thyroid area scanning with a linear array probe. The type of probe was ML6-15. To ensure the comparability of thyroid images in all patients, we kept the frequency of the parameter at 10MHz. The real-time US was performed by five physicians. Incorporating various studies, we consider the following US characteristics: margin (regular and irregular), border (unclear or clear), hackly border (present or absent), halo (present or absent), vascularity (peripheral, mixed, central), blood flow degree (low, medium, high), posterior echo attenuation (present or absent), lateral shadow (present or absent), echogenicity (hypoechoic or hyperechoic), calcification (micro-calcifications, macro-calcifications or none-calcifications), shape (AP/T ≥1 or AP/T <1) defined as the shape ratio (i.e., the ratio of the anteroposterior diameter of the nodule to the transverse diameter). Component (solid or mixed) was defined in terms of the ratio of the cystic portion to the solid portion as solid and mixed (e.g., see Fig. 1). The size and age of nodules are here considered.

Fig. 1
figure1

US scans show characteristics of thyroid modules: a beign nodules; b malignant nodule with hackly border; c malignant nodule with micro-calcifications; d malignant nodule with (AP/T ≥1) and irregular margin

Analysis of hypoechogenicity

Several studies [1416] have pointed out that hypoechogenicity is a highly suspicious characteristic of malignancy. Moreover, echogenicity can be measured by the echogenicity ratio (ER), which is defined as the ratio of the echogenicity of the nodule to the anterior cervical muscles. Echogenicity is usually classified into the following two categories: hypoechogenicity and hyperechogenicity. If the ER is less than or equal to some cutoff, it is taken as hypoechogenicity; otherwise, it is defined as hyperechogenicity.

To determine the best cutoff, we calculate the area under the receiver operating characteristic curve (AUC) at all the observed cutoffs ranging from 0.0 to 5.0. The cutoff corresponding to the maximum of AUC values is regarded as the optimal cutoff. The AUC is widely utilized as a measure of the performance of classifiers in machine learning, and is a better measure than Matthews correlation coefficient for assessing the prediction accuracy of a classifier in the imbalanced dataset.

Selection of US characteristics

Let yi be a binary response variable, i.e., yi=0 if the i-th nodule is benign, and yi=1 if the i-th nodule is malignant, and Xi=(xi1,xi2,…,xim) be a vector of US characteristics associated with the i-th nodule. The ordinary logistic regression (LR) for response yi has the form

$$ \text{Pr}\left(y_{i}=1|X_{i}\right)=\frac{\exp\left(\beta_{0}+\sum_{j=1}^{m}\beta_{j} x_{ij}\right)}{1+\exp\left(\beta_{0}+\sum_{j=1}^{m}\beta_{j} x_{ij}\right)}, $$
(1)

where β0 is an intercept, and β1,…,βm are regression coefficients. The nodule with the probability tending to 1 is regarded as a malignant nodule, while the nodule with the probability tending to 0 is taken as a benign nodule. It is widely recognized that the above considered LR model may be subject to the overfitting problem due to some inactive covariates encompassed. To address this issue, the best subset selection method, such as the Akaike information criterion (AIC) and Bayesian information criterion (BIC), can be used to select active covariates. However, it was a multi-step method. Thus, it is quite time-consuming when the number of covariates is moderate (e.g., [16]) or large. To solve the aforementioned problem, a well-known LASSO method [17] is employed to simultaneously estimate regression coefficients and select active US characteristics in the above considered LR model in that it is a regularization procedure that shrinks regression coefficients toward zero, and can simplify the model via variable selection procedure.

Estimators of parameters β0,β1,…,βm can be obtained by maximizing the following penalized log-likelihood function:

$$ \sum_{i=1}^{n}\left\lbrace \sum_{j=1}^{m}y_{i}\left(\beta_{0}+\beta_{j} x_{ij} \right) -\log\left[1+\exp\left(\beta_{0}+\sum_{j=1}^{m}\beta_{j} x_{ij}\right) \right] \right\rbrace -\lambda\sum_{j=1}^{m}\left|\beta_{j}\right| $$
(2)

where λ≥0 is a tuning parameter to be estimated. When λ is sufficiently large, some of parameter estimates are forced to be exactly zero [1820].

As is shown in Fig. 2, we randomly divided the whole dataset into 60% for the training dataset, 20% for the validating dataset, and 20% for the testing dataset using the stratified sampling technique. The LASSO shrinkage parameter λ (lambda.1se) is selected by the mean of 10-fold cross-validation using the glmnet R package for the training dataset. We then estimate parameters βj’s and select active US characteristics as those whose corresponding estimated parameters are not equal to zero, and take the model with maximizing AUC as the best model, where the result of the surgery is regarded as the gold standard of reference. The US characteristics with nonzero estimated parameters in the best model are retained as active predictors.

Fig. 2
figure2

Flow chart of our proposed method

Note that the US characteristics selected above are random since the considered dataset is randomly divided into training and testing data. To address the issue, we repeat the above-presented procedure 100 times and then retain the US characteristics that occur with the largest frequencies among 100 repetitions.

Scoring for nodules

Data were analyzed using R version 3.6.1 (2019-07-05), R packages [20] were used for each of the following classified methods: randomForest(·) (randomForest, RF: Random Forest), glm(·) (glmnet, LR: Logistic Regression), ksvm(·) (e1071, SVM: Support Vector Machine), nnet(·) (nnet, NET: neural network), elm_train(·) (elmNNRcpp, ELM: Extreme learning machine), kknn(·) (kknn, KNN: k-nearest neighborhood), naiveBayes(·) (e1071, NB: Naive Bayesian), boosting(·) (adabag, ADAB: Adaptive boosting), LiblineaR(·) (LiblineaR, LOG: L2-logistic regression), lda(·) (MASS, LDA: Linear discriminant analysis), respectively.

RF is an ensemble classifier that consists of many decision trees. Each decision tree is a classifier for classification. To classify an input sample, N trees have N classification results. The RF integrates all voting results and takes the class with the most voting times as the final output. At each tree split, a random sample of features is selected, and the tree is only allowed to split on those selected feature directions. Here, the “randomForest” function is the classification, and regression tree (CART) uses the Gini impurity criterion as a feature selection measure to construct a decision tree.

We train the classifiers on the training dataset mentioned above. When the training data are fed through the RF, a class probability (i.e., the level of risk) PRF that is the percentage of trees voted for the malignant nodules is outputted. Thus, a thyroid nodule is identified as malignant with probability PRF and benign with probability 1−PRF. For comparison, we also compute the results using a LR model, extreme learning machine (ELM) [21, 22] as well as the state-of-the-art methods (e.g., SVN, NET, KNN, NB, ADAB, LOG and LDA) discussed by Zhang et al. [12]. Again, to eliminate randomness, we repeat the above-presented partition 100 times, leading to 100 classifier sets. The risk score SRF of malignancy for each of thyroid nodules is defined as the averaged class probability for 100 repetitions. The risk scores corresponding to LR, SVM, NET, ELM, KNN, NB, ADAB, LOG and LDA methods are denoted as SLR, SSVM, SNET, SELM, SKNN, SNB, SADAB, SLOG and SLDA, respectively.

Here, the Caret package (e.g., https://cran.r-project.org/web/packages/caret/vignettes/\\caret.html) is employed to tune hyperparameters for all classifiers via 10-fold cross-validation on the training dataset based on the selection of the following parameters: ntree=500 and mtry=2, where grid search is conducted to tune hyper-parameter mtry. At the same time, a grid search method is utilized to optimize the corresponding parameters of SVM (sigma=0.071, and C=0.25, kernel=Radial Basis kernel), NET (size=1, decay=0.1), ELMtrain(nhid=50,actfun=sig), KNN(kernel=“rectangular”, k=9), NB(laplace=2), Boosting(boos = TRUE, mfinal=100, coeflearn= Breiman), LiblineaR(type=0, bias= “TRUE”, verbose= “FALSE"”) and LDA (method= mle).

Results

Hypoechogenicity

Hypoechogenicity does not mean that the echogenicity ratio (ER) is as low as possible. The previous studies show that hypoechogenicity is associated with the increased malignancy risk [3, 14]. It is easily seen from Fig. 3 that the rate of malignancy for hyperechoic nodules is much higher than that of hypoechoic nodules regardless of malignancy and benign nodules when the cutoff is less than 0.9. In contrast, when the cutoff is larger than 2.4, the rate of malignancy for hypoechoic nodules increases slowly regardless of malignancy and benign nodules. The optimal cutoff of ER should be taken so that hypoechogenicity has a good diagnostic performance of differentiating malignant and benign nodules. Figure 4 depicts the performance at each cutoff of the ER. Examination of Fig. 4 implies that the optimal cutoff should be taken as 1.3 because the AUC attains the maximum 0.65 at cutoff=1.3; and for the optimal cutoff, hypoechogenicity has the sensitivity (SEN) 76.7%, the specificity (SPE) 52.8%, the positive predictive value (PPV) 88.7%, and the negative predictive value (NPV) 31.9% when detecting malignant nodules. On the other hand, for the malignant (or benign) cases, 76.7% (or 47.2%) is hypoechoic (e.g., see Fig. 3), leading to the conclusion that there is a significant difference between the malignant and benign nodules for hypoechogenicity due to the p-value (P<0.001) calculated from Fisher’s exact test method. The above fact shows that using the optimal cutoff=1.3 to distinguish both malignant and benign nodules can yield good diagnostic performance.

Fig. 3
figure3

Cutoff of echogenicity ratio vs. rate of malignancy. The vertical line corresponds to the cutoff =1.3

Fig. 4
figure4

Cutoff and performance of hypoechogenicity in the diagnosis of malignant nodules. Scatter diagram of ER distribution for benign and malignant thyroid nodules. The vertical line in a and horizontal line in b correspond to the cutoff = 1.3, respectively

US characteristic selection

From LLR analysis, characteristics: nodule size, AP/T≥1, solid component, micro- calcifications, hackly border, hypoechogenicity, presence of halo, unclear border, irregular margin and central vascularity are selected as active predictors associated with malignancy. Table 2 reports the diagnostic performance for each of the US characteristics in terms of the prediction of malignancy. Examination of Table 2 shows that the selected characteristics have relatively high NPV (23.5%–76.5%), SEN (57%–96.4%), and AUC (0.547–0.776) compared to those not selected with NPV (15.6%–20.3%), SEN (4.3% – 34.8%) and AUC (0.456–0.575). Among the selected characteristics, irregular margin (NPV: 49.2%, SEN: 85.0%) achieves the highest AUC (0.776). Central vascularity identified by spectral Doppler US is also selected as a malignant characteristic, even though some studies suggest that the increased central vascularity is not reliable for the malignant evaluation of thyroid nodule, and other authors pointed out that the increased central vascularity is accepted as a supporting characteristic for diagnosis of malignancy [23]. Nodule size and age are detected as the characteristics of malignancy. Moreover, the Mann–Whitney test shows a statistical difference in terms of the size and age between benign and malignant nodules due to (P <.001). More importantly, the selected characteristics are more critical than the remaining characteristics for thyroid nodule diagnosis on the training dataset (e.g., see Fig. 5). The selected characteristics are marked by bold. Reference categories for each of US characteristics are those with the lowest malignancy rate in that our main purpose is to select active predictors associated with malignancy.

Fig. 5
figure5

Importance of US characteristics by RF

Table 2 NPV, PPV, SEN, SPE and AUC values for each of US characteristic in disease nodules

Performance of the predictive model

We utilize the class probabilities to predict the risk of malignancy for each of the nodules on the basis of the classifiers obtained from the training dataset. A nodule is predicted as malignancy if the class probability has a higher value than the given cutoff (the optimal cutoff point on the AUC closest to (0,1)). To measure the performance of the four classifiers, we use six metrics: AUC, SEN, F1 score, SPE, PPV, NPV, which are the averages calculated with 100 repetitions. As is shown in Table 3, the LR (i.e., LR model with stepwise selection procedure) show the highest AUC (i.e., 0.965) regardless of the validating and testing datasets, the RF produces the highest SEN (i.e., 89.2%) for the validating dataset and the second highest SEN (i.e., 88.3%) for the testing dataset, the ADAB has the highest F1 (i.e., 0.73) for the validating dataset, the NET leads to the highest F1 (i.e., 74.6%) for the testing dataset, the LOG produces the highest SPE (i.e., 96.0%) regardless of the validating and testing datasets, the SVM shows the highest PPV (i.e., 98.1%) for the validating dataset, and the ADAB yields the highest PPV (i.e., 98.5%) for the testing dataset and has the highest NPV (i.e., 62.9%) for the validating dataset. These observations show the evidence that none of ten classifiers performs best at all metrics when we only use a cutoff of the class probability to differentiate between benign and malignant cases.

Table 3 Comparison of classification performance of machine learning methods on validating and testing datasets

The scoring system of thyroid nodules

Only using the cutoff of the class probability to differentiate thyroid cancer may result in an increase in misdiagnosis or missed diagnosis due to the considerable overlapping of the US characteristics for benign and malignant nodules [6, 24, 25]. Categorizing nodules and stratifying their risks of malignancy according to the risk score (class probability) may be one of the most efficient approaches to solve this problem. The greater risk score suggests a higher malignant risk. Figure 6 displays the sores for each of nodules for ten classifiers on the training and validating datasets. It is observed that most of the scores associated with malignant nodules are greater than those associated with benign nodules. Figure 6 shows the risk scores of benign and malignant nodules much more overlapped for LR, SVM, NET, ELM at the bottom of the band, which indicate that these classifiers score lowly for some malignant nodules; thus a true malignant nodule may be incorrectly classified as a benign one.

Fig. 6
figure6

The risk score of malignancy for each thyroid nodule, as calculated by statistics methods respectively. The green crossings represent malignant nodules and blue dots benign nodules

For benign nodules, 75% of risk scores is less than 0.497 using the RF. In contrast, 75% of risk scores is less than 0.617 using the LR, 0.644 using the SVM, 0.561 using the NET, 0.588 using the ELM, 0.636 using the KNN, 0.437 using the NB, 0.313 using the ADAB, 0.635 using the LOG and 0.712 using the LDA (e.g., see Fig. 7). Meanwhile, for malignant nodules, the lowest SRF,SLR,SSVM,SNET,SELMSKNN,SNB,SADAB,SLOG and SLDA are 0.152, 0.065, 0.025, 0.128, 0.002, 0.000, 0.000, 0.221, 0.105 and 0.006, respectively.

Fig. 7
figure7

The box bar graphs show the risk score of malignancy for benign and malignant nodules

We establish a classification system for thyroid nodules (e.g., see Table 4) in terms of risk scores. For training and validating datasets, 1158 (malignant case=960, benign case=198) thyroid nodules are classified into four categories via the risk scores: benign category, in which the nodule has a score less than the 95% confidence lower limit lc of its mean calculated using the bootstrap percentile method for 1000 bootstrap replications; low suspicion category, in which the nodule has a score ranged from lc to 0.5; intermediate suspicion category that includes the nodules with scores ranged from 0.5 to the cutoff hc on the AUC closest to (0,1); high suspicion of malignancy category, in which the nodule has a score greater than hc.

Table 4 Risk scoring system of thyroid nodules on the training and validating dataset

As is shown in Table 4, using the RF, the malignancy rate achieves the lower value in the benign category and the higher value in the high suspicion category; while for ADAB method, although it can get the lowest malignancy rate in the benign categroy and the highest malignancy rate in the high suspicion category, but its computation is time-consuming. We also recommend guidelines for the management of thyroid nodules according to their categories. The risk scoring system on the basis of RF is superior to those of other methods in diagnosing thyroid cancer in terms of malignancy rates of 3.5%, 21.2%, 57.3%, and 98.7% in benign category, low suspicion category, intermediate suspicion category, and high suspicion of malignancy category, respectively.

Final validation

To avoid the over-fitting of the classifiers and test the reliability of the risk scoring system using the RF, we conduct the final validation on the testing dataset. The results are given in Table 5. Among the considered ten classifiers, the RF yields malignancy rates of 0%, 14.3%, 58.1% and 96.1% in benign category, low suspicion category, intermediate suspicion category, and high suspicion of malignancy category, respectively, compared with 4.2%, 37.5%, 63.0%, and 96.9% for the LR, 14.8%, 37.5%, 58.3%, and 96.8% for the SVM, 3.7%, 35.3%, 69.6%, and 97.3% for the NET, 4.3%, 50.0%, 59.1%, and 97.7% for the ELM, 8%, 11.1%, 73.3%, and 96.7% for the KNN, 22%, 71.4%, 67.9%, and 96.7% for the NB, 0%, 26.7%, 82.4%, and 93.6% for the ADAB, 5.3%, 23.5%, 64.3%, and 98.1% for the LOG, 6.9%, 33.3%, 64.9%, and 97.2% for the LDA, which show that RF method outperforms other nine methods.

Table 5 Risk scoring system of thyroid nodules on the testing dataset

Abnormal nodules

In the abnormal nodules, there is considerable overlap between the characteristics of malignant and benign nodules. Fifty-five (55.6%) of 99 benign nodules has AP/T≥1; 69 (69.7%) has hackly border; 61 (61.6%) contains micro-calcifications; and 100% is solid. In contrast, among 11 malignant nodules, only one (9.1%) has AP/T≥1, hackly border, solid component, and micro-calcifications, respectively, which have significant association with malignancy. The values of SEN, SPE, PPV, and AUC for each of US characteristics for the abnormal nodules are lower than those for the disease nodules.

When the abnormal nodules are added to the disease nodules, the performance metrics of all the US characteristics decrease except for NPV (e.g., see Table 6), and the performance metrics of four classifiers (e.g., see Table 7) also decrease. For example, Table 7 shows that the LR and LDA methods show the highest AUC (i.e., 0.820), the NB method produces the highest SEN (i.e., 85.1%) and NPV (i.e., 55.0%), the LR method has the highest F1 (i.e., 0.611). While the NET method produces the highest SPE (i.e., 74.3%) and PPV (i.e., 91.4%). At the same time, the RF method yields better results than other methods in terms of risk scores and risk scoring system (e.g., see Tables 8 and 9).

Table 6 NPV, PPV, SEN, SPE and AUC values for each of US characteristic in overall nodules
Table 7 Comparison of classification performance of machine learning methods on validating dataset of overall nodules
Table 8 Risk scoring system of thyroid nodules on training and validating dataset in overall nodules
Table 9 Risk scoring system of thyroid nodules on testing dataset in overall nodules

Discussion

From LLR, the US characteristics: tumor size, AP/T, solid component, micro-calcifications, hackly border, hypoechogenic area, present halo, unclear border, irregular shape, and central vascularity were showed significant association with malignancy. In fact, previous studies have shown that the presence of AP/T, solid component, micro-calcifications, irregular shape were consistently associated with a higher risk of malignancy [8]; absent halo and vascular pattern can be suggestive of malignancy [4]; tumor size and hackly border were risk factor for detecting malignanct nodules [10, 26]. Results obtained with the LLR method were compared with those of the management guidelines [10] and many previous studies [3, 8, 26], in which a solid hypoechoic nodule or solid hypoechoic component of a partially cystic nodule has the following one or many characteristics: hackly border, micro-calcifications, AP/T>1, high suspicion US pattern. The comparison indicates the effectiveness of the LLR method for selecting active features. Incorporating these characteristics as predictors has relatively higher SEN, AUC, SPE, NPV, and PPV values than those only using one of the characteristics. Consequently, a combination of highly correlated characteristics can indeed improve the performance of the prediction of malignancy-risk compared with the usage of a single characteristic, which is consistent with that no single US feature on its own can reliably differentiate malignant nodules from benign ones [12].

Our proposed hybrid method (i.e., incorporating LLR and RF) can not only select important US features via LASSO but also obtain risk score via the LR model with the selected predictors, which is a basic information for classification and leads to a more effective and objective diagnosis than conventional classifiers discussed in Zhang et al. [12]. Although Zhang et al. [12] compared the performance of conventional classifiers with that of RF method and recommended the uage of RF method, but they did not provide a quantitative approach to assess the risk of the malignant nodule and consider to calculate the risk score of thyroid nodules, leading to unknown information on the level of risk for the classifier. At the same time, statistical results show that our proposed hybrid classifier outperforms other classifiers such as LR, SVM, NET, ELM, NB, ADA, LOG and LAD in terms of their corresponding malignancy rates, which implies that incorporating LR model with the incorporated predictors and RF method can improve the performance of the prediction of malignancy. In particular, our proposed method behaves better than the widely used RF method in terms of risk score sytem in that we utilize the optimal cutoff point to replace the default cutoff point in implementing RF algorithm. Although extreme learning machine (ELM) has been explored to discriminate malignant and benign thyroid nodules based on the sonographic features in ultrasound images [22], but then did’t compare with other methods. In addition, in the previous studies (e.g., see [7, 12, 22, 24]), thyroid nodules were classified into two kinds: benign and malign based on the US characteristics together with the default cutoff of class probability (i.e., 0.5), which may result in an increase in misdiagnosis or missed diagnosis due to the considerable US characteristics common to benign and malignant nodules. In tihs study, we categorized nodules into the following four categories: benign, low suspicion, intermediate suspicion, high suspicion according to the risk score of thyroid nodule and 2015 American Thyroid Association management guidelines [10].

US characteristics, which are suggestiveness of thyroid malignancy, should be indication for Fine Needle Aspiration (FNA) biopsy and even further treatment such as surgery. However, different levels of clinical experience and description of US findings might cause diverse diagnostic accuracies. Thus, there is a significant demand to establish some objective criteria to select nodules for FNA biopsy or surgery to minimize costs. In our study, we scored each of the thyroid nodules and designed a scoring system to classify thyroid nodules in terms of their class probabilities calculated by RF. Our score system could (i) standardize categorical reporting system and make ultrasonic report objective; (ii) quantize the description of the US finding indexes and provide helpful clinician guidelines in classifying the nodules, stratifying the risk of thyroid tumors, selecting patients to surgery or providing appropriate follow-up; (iii) significantly reduce the misdiagnosis after summarizing a large number of clinicians’ experience.

The malignancy-risk score computed by the RF algorithm conferred higher risk to malignant nodules as well as a lower risk to benign nodules rather than the number of suspicious characteristics; and then classified nodules into several diagnostic categories, each of which was associated with different cancer risks, ranging from benign to high suspicion. Therefore, clinicians or patients could get a definite possibility for malignancy of thyroid tumors through our presented scoring system.

“Hypoechoicinity” is a qualitative term and cannot give a piece of absolute objective information on the degree of echogenicity [27]. Considering the difference between the imaging of the diagnostic scanner and the subjective diagnosis of the radiologists, we used the relatively scientific echo ratio to unify the traditional echo intensity and quantify it. Patients with thyroid nodules often had diffuse thyroid lesions, and the level of glandular echo greatly changed. Accordingly, we divided the light and dark values in the image by the number of the anterior cervical anterior muscles with the echo level as the echo ratio parameter. At the same time, the quantified nodule echo values allowed us to further search for diagnostic cutpoint to substitute for the traditional diagnostics with hypoecho, equal echo, and hyperecho. Hypoechogenicity was operatively defined as the echogenicity ratio of less than or equal to 1.3 in our study, which was open to debate.

In our study, all nodules had been surgically diagnosed to be benign and malignant, helping us to evaluate the performance of classifiers. But this also led to sampling bias since nodules with a relatively higher risk of malignancy were usually recommended for surgery regardless of true benign, which directly led to benign nodules with multiple malignant characteristics in our samples. For example, there is no cystic nodule, which is one of the benign characteristics of thyroid nodules [10]. Therefore, it is difficult for radiologists or the computerized systems to correctly diagnose such benign nodules. As a result, the rate of misdiagnosis is usually high. However, the RF still performs better than other methods regardless of the disease and abnormal nodules. The diagnosis of abnormal nodules needs to be very careful since they may also be cancerous. Consequently, it can be categorized as borderline and recommended to FNA biopsy. In addition, Table 3 shows that the high prevalence of malignancy may affect the accuracy of the prediction for benign nodules, thus leading to the low NPV of classifiers for RF: 61.2%, LR: 60.0%, SVM: 58.2%, NET: 60.8%, ELM: 52.4%, KNN: 51.7%, NB: 56.9%, ADAB: 62.9%, LOG: 56.5% and LDA: 55.6%, respectively.

The limitation of this paper includes without considering real-time elastography data [12], interactions among the considered US characteristics, and outlier detections.

Conclusions

We detected the US indicative characteristics of malignancy in thyroid nodules and designed a practical classifier scheme based on these characteristics to quantize the risk of malignancy. It could standardize the categorical reporting system and objectively make an ultrasonic report as well as simplify the description of the US characteristics by radiologists. The scoring system can be used to predict the risk of malignancy and guide the management decisions so as to reduce the number of unnecessary biopsy for benign nodules. In view of the fact that the proposed LLR together with RF performs better than other methods in identifying malignancy, especially for abnormal nodules, in terms of risk scores, we recommend the usage of the LLR together with RF method in applications.

Availability of data and materials

All data generated or analyzed during this study are included in this published article. Please contact the author for the code of the software and the documentation.

Abbreviations

TI-RADS:

Thyroid Imaging Reporting and Data System

RF:

Random Forest

ER:

the Echogenicity Ratio

TP:

True positives

TN:

True negatives

FP:

False positives

FN:

False negatives

LLR:

Logistic Lasso Regression

ROC:

Receiver operating characteristic curve

SEN:

Sensitivity

SPE:

Specificity

PPV:

Positive predictive value

NPV:

Negative predictive value

AUC:

Area under curve

SVM:

support vector machine

NET:

neural network

KNN:

K-nearst neighborhood

NB:

Naive Bayesian

ADAB:

Adaptive boosting

LOG:

L2-logistic regression

LDA:

Linear discriminant analysis

ELM:

Extreme learning machine

FNA biopsy:

Fine Needle Aspiration biopsy.

References

  1. 1

    Kwak JY, Han KH, Yoon JH, Moon HJ, Son EJ, Park SH, Jung HK, Choi JS, Kim BM, Kim E-K. Thyroid imaging reporting and data system for US features of nodules: a step in establishing better stratification of cancer risk. Radiology. 2011; 260(3):892–9.

    PubMed  Google Scholar 

  2. 2

    Wang Y, Lei K-R, He Y-P, Li X-L, Ren W-W, Zhao C-K, Bo X-W, Wang D, Sun C-Y, Xu H-X. Malignancy risk stratification of thyroid nodules: comparisons of four ultrasound Thyroid Imaging Reporting and Data Systems in surgically resected nodules. Sci Rep. 2017; 7(1):1–10.

    PubMed  PubMed Central  Google Scholar 

  3. 3

    Adamczewski Z, Lewiński A. Proposed algorithm for management of patients with thyroid nodules/focal lesions, based on ultrasound (US) and fine-needle aspiration biopsy (FNAB); our own experience. Thyroid Res. 2013; 6(1):6.

    PubMed  PubMed Central  Google Scholar 

  4. 4

    Morris LF, Ragavendra N, Yeh MW. Evidence-based assessment of the role of ultrasonography in the management of benign thyroid nodules. World J Surg. 2008; 32(7):1253–63.

    PubMed  Google Scholar 

  5. 5

    Horvath E, Majlis S, Rossi R, Franco C, Niedmann JP, Castro A, Dominguez M. An ultrasonogram reporting system for thyroid nodules stratifying cancer risk for clinical management. J Clin Endocrinol Metab. 2009; 94(5):1748–51.

    CAS  PubMed  Google Scholar 

  6. 6

    Park J-Y, Lee HJ, Jang HW, Kim HK, Yi JH, Lee W, Kim SH. A proposal for a thyroid imaging reporting and data system for ultrasound features of thyroid carcinoma. Thyroid. 2009; 19(11):1257–64.

    PubMed  Google Scholar 

  7. 7

    Kwak JY, Jung I, Baek JH, Baek SM, Choi N, Choi YJ, Jung SL, Kim E-K, Kim J-A, Kim J-h, Kim KS, Lee JH, Moon HJ, Moon W-J, Park JS, Ryu JH, Shin JH, Son EJ, Sung JY, Na DG. Erratum: Image reporting and characterization system for ultrasound features of thyroid nodules: multicentric korean retrospective study. Korean J Radiol. 2013; 14(2):389.

    PubMed Central  Google Scholar 

  8. 8

    Kwak JY, Han KH, Yoon JH, Moon HJ, Son EJ, Park SH, Jung HK, Choi JS, Kim BM, Kim E-K. Thyroid imaging reporting and data system for US features of nodules: a step in establishing better stratification of cancer risk. Radiology. 2011; 260(3):892–9.

    PubMed  Google Scholar 

  9. 9

    Russ G, Royer B, Bigorgne C, Rouxel A, Bienvenu-Perrard M, Leenhardt L. Prospective evaluation of thyroid imaging reporting and data system on 4550 nodules with and without elastography. Eur J Endocrinol. 2013; 168(5):649–55.

    CAS  PubMed  Google Scholar 

  10. 10

    Haugen BR, Alexander EK, Bible KC, Doherty GM, Mandel SJ, Nikiforov YE, Pacini F, Randolph GW, Sawka AM, Schlumberger M, Schuff KG, Sherman SI, Sosa JA, Steward DL, Tuttle RM, Wartofsky L. 2015 American Thyroid Association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: the American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid. 2016; 26(1):1–133.

    PubMed  PubMed Central  Google Scholar 

  11. 11

    Tessler FN, Middleton WD, Grant EG, Hoang JK, Berland LL, Teefey SA, Cronan JJ, Beland MD, Desser TS, Frates MC, Hammers LW, Hamper UM, Langer JE, Reading CC, Scoutt LM, Stavros AT. ACR thyroid imaging, reporting and data system (TI-RADS): white paper of the ACR TI-RADS committee. J Am Coll Radiol. 2017; 14(5):587–95.

    PubMed  Google Scholar 

  12. 12

    Zhang B, Tian J, Pei S, Chen Y, He X, Dong Y, Zhang L, Mo X, Huang W, Cong S, Zhang S. Machine learning–assisted system for thyroid nodule diagnosis. Thyroid. 2019; 29(6):858–67.

    PubMed  Google Scholar 

  13. 13

    Xu R, Yi D, Xia J. The principal research to assess the outliers of the logistic regression model. Acta Academlae Medicinae Militaris Tertlae. 1994; 16(5):326–8.

    Google Scholar 

  14. 14

    Wu M-H, Chen C-N, Chen K-Y, Ho M-C, Tai H-C, Wang Y-H, Chen A, Chang K-J. Quantitative analysis of echogenicity for patients with thyroid nodules. Sci Rep. 2016; 6:35632.

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15

    Tutuncu Y, Berker D, Isik S, Akbaba G, Ozuguz U, Kucukler FK, Göcmen E, Yalcın Y, Aydin Y, Guler S. The frequency of malignancy and the relationship between malignancy and ultrasonographic features of thyroid nodules with indeterminate cytology. Endocrine. 2014; 45(1):37–45.

    CAS  PubMed  Google Scholar 

  16. 16

    Kim JY, Kim SY, Yang KR. Ultrasonographic criteria for fine needle aspiration of nonpalpable thyroid nodules 1-2 cm in diameter. Eur J Radiol. 2013; 82(2):321–6.

    PubMed  Google Scholar 

  17. 17

    Pereira JM, Basto M, da Silva AF. The logistic lasso and ridge regression in predicting corporate failure In: Iacob AI, editor. 3rd Global Conference on Business, Economics, Management and Tourism: 2016. p. 634–41.

  18. 18

    Kim SM, Kim Y, Jeong K, Jeong H, Kim J. Logistic LASSO regression for the diagnosis of breast cancer using clinical demographic data and the BI-RADS lexicon for ultrasonography. Ultrasonography. 2018; 37(1):36–42.

    PubMed  Google Scholar 

  19. 19

    James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning with applications in R, 1st; 2013, pp. 221–7.

  20. 20

    Hastie T, Tibshirani R, Friedman J. The elements of statistical learning: data mining, inference, and prediction, 2nd edn; 2009;99, pp. 61–79.

  21. 21

    Huang G, Huang G-B, Song S, You K. Trends in extreme learning machines: A review. Neural Netw. 2015; 61:32–48.

    PubMed  Google Scholar 

  22. 22

    Xia J, Chen H, Li Q, Zhou M, Chen L, Cai Z, Fang Y, Zhou H. Ultrasound-based differentiation of malignant and benign thyroid nodules: An extreme learning machine approach. Comput Methods Programs Biomed. 2017; 147:37–49.

    PubMed  Google Scholar 

  23. 23

    Algin O, Algin E, Gokalp G, Ocakoğlu G, Erdoğan C, Saraydaroglu O, Tuncel E. Role of duplex power Doppler ultrasound in differentiation between malignant and benign thyroid nodules. Korean J Radiol Off J Korean Radiol Soc. 2010; 11(6):594–602.

    Google Scholar 

  24. 24

    Watters DAK, Ahuja AT, Evans RM, Chick W, King WWK, Metreweli C, Li AKC. Role of ultrasound in the management of thyroid nodules. Am J Surg. 1992; 164(6):654–7.

    CAS  PubMed  Google Scholar 

  25. 25

    Wienke JR, Chong WK, Fielding JR, Zou KH, Mittelstaedt CA. Sonographic features of benign thyroid nodules: interobserver reliability and overlap with malignancy. J Ultrasound Med. 2003; 22(10):1027–31.

    PubMed  Google Scholar 

  26. 26

    Papini E, Guglielmi R, Bianchini A, Crescenzi A, Taccogna S, Nardi F, Panunzi C, Rinaldi R, Toscano V, Pacella CM. Risk of malignancy in nonpalpable thyroid nodules: predictive value of ultrasound and color-Doppler features. J Clin Endocrinol Metab. 2002; 87(5):1941–6.

    CAS  PubMed  Google Scholar 

  27. 27

    Erol B, Kara T, Gürses C, Karakoyun R, Köroğlu M, Süren D, Bülbüller N. Gray scale histogram analysis of solid breast lesions with ultrasonography: can lesion echogenicity ratio be used to differentiate the malignancy?Clin Imaging. 2013; 37(5):871–5.

    PubMed  Google Scholar 

Download references

Acknowledgements

The research was carried out using supercomputers at Yunnan Key Laboratory of Statistical Modeling and Data Analysis.

Funding

Financial support comes from Key Projects of the National Natural Science Foundation of China (Grant No. 11731011), Yunnan Medical Science leader project (D-201648), Yunling technology and industry leader project (Zhu M), and Projects of the Department of Science and Technology of Yunnan Province (2016FA031).

Author information

Affiliations

Authors

Contributions

Dan Chen implemented the computation, partially interpreted the results, and finished the final manuscript. Jun Hu and Yang Yang partially analyzed the results and wrote the draft of the manuscript. Mei Zhu participated in the elaboration of the biological concept concerning the importance of local. Niansheng Tang conceived and designed the work. Yuran Feng collected the data. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Jun Hu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chen, D., Hu, J., Zhu, M. et al. Diagnosis of thyroid nodules for ultrasonographic characteristics indicative of malignancy using random forest. BioData Mining 13, 14 (2020). https://doi.org/10.1186/s13040-020-00223-w

Download citation

Keywords

  • Random forest
  • Risk score
  • Thyroid nodule
  • Ultrasonographic characteristic