Predicting metabolite-disease associations based on KATZ model

Background Increasing numbers of evidences have illuminated that metabolites can respond to pathological changes. However, identifying the diseases-related metabolites is a magnificent challenge in the field of biology and medicine. Traditional medical equipment not only has the limitation of its accuracy but also is expensive and time-consuming. Therefore, it’s necessary to take advantage of computational methods for predicting potential associations between metabolites and diseases. Results In this study, we develop a computational method based on KATZ algorithm to predict metabolite-disease associations (KATZMDA). Firstly, we extract data about metabolite-disease pairs from the latest version of HMDB database for the materials of prediction. Then we take advantage of disease semantic similarity and the improved disease Gaussian Interaction Profile (GIP) kernel similarity to obtain more reliable disease similarity and enhance the predictive performance of our proposed computational method. Simultaneously, KATZ algorithm is applied in the domains of metabolomics for the first time. Conclusions According to three kinds of cross validations and case studies of three common diseases, KATZMDA is worth serving as an impactful measuring tool for predicting the potential associations between metabolites and diseases.


Background
Metabolism, a generic term for a series of ordered chemical reactions, plays a critical role in maintaining human life such as the growth and reproduction of organisms and the reaction to the external environment in body [1][2][3]. Numerous researches and experiments have indicated that some kinds of metabolites in concentration are distinct when people get ill compared with healthy people [4]. Hence, relevant metabolitedisease association is one of the significant judgements for doctors to diagnosing and treatment [4]. There are many examples such as diabetes. When it comes to blood sugar, people maybe think of one disease named diabetes naturally. Because the concentration of blood sugar in diabetes patient's body is usually higher than normal body. In the past 10 years, Many metabolites which changed significantly such as the concentration of blood sugar have been gradually known as one of the criteria for doctors to diagnose diabetes after a quantity of experiments and clinical cases [5]. Based on the above example, it apparently reveals that metabolites also play an indispensable role in researching human diseases, which increasingly become a hot topic to explore the associations of them.
With the improvement of high-throughput metabolomics technologies, researchers could obtain a great deal of precious information. Meanwhile, metabolomic databases have been gradually developed, which is critical to the development of metabolomics [6]. For instance, HMDB database [7] which contains reliable information of human metabolites has continued to grow and evolve with enhancement and expansion of existing data from version 1.0 to 4.0 [7]. However, the identification of the associations between metabolites and diseases is only a tip of the iceberg, which indicates that thousands of potential metabolism and disease associations need to be tested and proved. However, conventional biology experiments can be tested and verified some assumptions but usually take a considerable time to get results. If the bias of results and assumptions are too large or results are not much more significant, experimenters may have to bear the financial loss. Thus, it is more important to develop computational methods which can save experimental time and fund and supply available prediction results. Some relevant methods of predicting potential associations between different biological molecules have been delivered for genomics such as gene-disease correlations [8][9][10], transcriptomics like circRNA-disease associations [11,12] and proteomics such as identification of essential proteins [13][14][15], but the computational methods for predicting metabolite-disease associations are very few such as "Identifying diseases-related metabolites using random walk" [16] which is the first method to explore the latent associations and promote the development of computational method in metabolomics. However, they only consider the disease similarity when calculating metabolite similarity. In order to make full use of the known data, we use metabolite GIP kernel similarity to metabolite similarity and add the integrated disease similarity to calculate the predicted results.
In this study, we put forward one computational method named KATZMDA to explore novel metabolite-disease associations. Our proposed method is enlightened by KATZ algorithm, which has been utilized to predict the associations in social networks. Our computational method mainly consists of three steps: Firstly, the raw resources which come from the newest version of HMDB are gained for the basic data of prediction. Secondly, we compute the similarity for metabolites and diseases to rich types of data, where metabolite similarity network is computed by metabolite GIP kernel similarity while the improved disease GIP kernel similarity sub-network and semantic similarity sub-network are integrated into the disease similarity network. Thirdly, we predict metabolite-disease associations based on KATZ algorithm. Finally, we adopt the leave-one-out cross validation (LOOCV) and 5-fold and 10-fold cross validation to evaluate the performance of KATZMDA which acquired the AUC (area under the ROC curve) values of 0.9186, 0.8897+/− 0.0173 and 0.9029+/− 0.0073, respectively. For the sake of further verification, we utilize case studies of Liver disease, Cerebral infarction and Gestational diabetes, respectively. What's more, the values of AUC confirm that our method is better than other methods in section of Comparison with other methods. Therefore, the results indicate that KATZMDA is forceful and dependable in predicting potential metabolite-disease associations.

Leave-one-out cross validation (LOOCV)
It is a common tool for LOOCV to evaluate the performance of our proposed computational method. In LOOCV, if one known association of metabolite and disease is used as a test set, the rest of known associations are regarded as training sets and the unknown associations become as candidate sets. Finally, a result will be obtained when all the known associations take turns as test sets. There are 4537 known metabolite-disease associations, so our experiment needs to be run 4538 times. In every loop, the test sample is considered as successful prediction result if the rank of the test sample is beyond the given threshold. According to changing thresholds, we finally acquire a series of values about True Positive Rate (TPR, sensitivity) and False Positive Rate (FPR, 1-specificity), which can help to depict the ROC curve. The prediction performance in our model is gained after calculating AUC. If the AUC tends to 1, the performance will be perfect. Moreover, when the AUC tends to 0.5, it indicates that the performance is random. If the AUC tends to 0, the performance is terrible. With several experiments, we find that our proposed computational model acquires better LOOCV performance that the relevant AUC value is 0.90 when parameter k is equal to 2. While, if parameter k is beyond to 2, the AUC will drop down (see Fig. 1 (a)).

K-fold cross validation
K-fold cross validation is also implemented for the performance evaluation of our method. In K-fold cross validation, all the known metabolite-disease pairs are randomly and averagely decomposed k parts. One part is regarded as a test sample, then the rest of parts (k-1) is utilized for training. As above mentioned in LOOCV, unknown relations in metabolite-disease pairs are utilized as candidate samples in K-fold cross validation. Specifically, 5-fold and 10-fold cross validation are adopted to deeply evaluate the prediction performance of KATZMDA. Given the influence of the latent bias, when dividing random sets for evaluating performance, we set this experiment to loop many times, then the correlative ROC curves and AUCs are acquired as LOOCV. Lastly, we

Comparison with other methods
In order to evaluate the performance of KATZMDA in predicting potential metabolitedisease associations, we compare KATZMDA with the methods such as random walk restart (RWR) and PageRank method and implement the validation experiments mentioned above on each method based on the same dataset. In RWR, we use the same parameters as Hu's method [16]. For LOOCV, RWR, PageRank gained AUCs of 0.7633, 0.8242, respectively. For 5-fold cross validation, RWR, PageRank gained AUCs of 0.6692, 0.7951, respectively. For10-fold cross validation, RWR, PageRank gained AUCs of 0.7266, 0.8113, respectively (see Fig. 2). According to these evaluation mechanisms, KATZMDA can obtain higher AUC value. It means that KATZMDA is more effective than those compared methods and has a latent capability to explore more novel metabolite-disease associations.

Parameters analyzing
In this section, we are committed to find the influence of some parameters and the best parameters on our proposed method. Then we analyze the following parameters: γ as a weighted parameter determines the proportion of the two types of disease similarities which affects the final disease similarity. So, it is essential to analyze it which is changed from 0.1 to 0.9 (see Table 1). Referring to the previous study, the parameter δ is selected below 1/‖ M ‖ 2 . However, we change its value as γ to explore its effect to our method (see Table 2). We find that it is steadier for AUC when changing δ and then we set 0.1 to the best value. The parameter k which represents the length of path between metabolites and diseases is always set 3 but we find the suitable value of k is 2 when obtaining the best estimated performance after several tests in our experiment (see Tables 1 and 2, Fig. 1 (a)). The results of different values of k are displayed (see Tables 1 and 2 Fig. 3 (a-c)). Considering the efficiency of time, we adopt the five-fold Fig. 2 The ROC about k-fold cross-validation. Comparison of KATZMDA with other computational methods for (a)5-fold cross-validation, (b)10-fold cross-validation cross validation to calculate above results. Finally, we select the best parameters group in each value of k for comparison (see Fig. 3 (d)). The best parameters are set as follows: k = 2, γ =0.1 and δ =0.1, respectively.

Case study
In this section, we have taken several diseases as examples to make case studies, which can make us deeply realize the associations between metabolites and diseases. There are three common diseases which are Liver disease, Cerebral infarction and Gestational diabetes, respectively. Considering the accuracy of results in our method, we find some details in published papers to prove the relevant prediction associations. For the above mentioned diseases, we select the neighbors of themselves and their relevant known metabolites to seek the associations between these two types of neighbors and predictive metabolites, respectively, which takes Cerebral infarction as an example showing in Fig. 4.
Liver disease means a lesion that occurs in the liver and happens all the time around people. It is a total name of high-risk disease about liver, which includes viral hepatitis, liver abscess, alcoholic hepatitis and fatty liver. We carry out a case study of liver disease with our method. Finally, there are top 10 predicted metabolites having been confirmed to have some influence on the liver disease patients by calculating known associations on our method (see Table 3). Taking follows as examples, Glycine(1st) is proved to not only treat alcoholic hepatitis, but also prevent and treat hepatocellular carcinoma in alcoholic cirrhosis [17]. What's more, Glycine [18] is a kind of effect immuno-nutrient substance when treated diverse chronic liver diseases [17]. L-Serine, Creatine, L-Tryptophan, Cholesterol (2nd, 3rd, 4th, 9th) were revealed to have significant influence to one kind of Liver disease named fatty liver [19][20][21][22].
Cerebral infarction is one of the most common diseases in cerebrovascular disease. In the Cerebral infarction-related metabolites prediction results, top 10 predicted metabolites have been verified. by published references (see Table 4). For instance, Glycine could abate Cerebral infarction caused by ischemia/reperfusion in mice [23].
Gestational diabetes is one kind of common diseases which affects 5 to 6% of pregnant women [24]. There are some predicting associations which shows top 10 predicted metabolites and 9 of top 10 predicted Gestational diabetes-related metabolites have been certified (see Table 5). More and more details indicated that the Substance might be a new role which lead not only to the development of diabetes gestational diabetes, but also diabetes mellitus type 2 [24]. Although there is no clear evidence to confirm the associations between Guanidoacetic acid and Gestational diabetes, some experimental literatures show that the detection of Guanidoacetic acid is an available indicator for renal tubular dysfunction in the early phase of diabetes mellitus [25].

Discussions
Large quantities of evidences have revealed that metabolites in human body are implicated in reflecting human physiological such as complicated disease pathology.
Although biotic experiments can explore potential metabolite-disease associations and help people acquire data which we need. However, these methods are time-consuming and expensive. Here, we put forward a practical method named KATZMDA, which not only guarantees the accuracy of predicting the latent associations between metabolites and diseases but also effectively cuts down the time and investment. In this study, we firstly calculate metabolite/disease similarities by combining their relevant similarities. Secondly, we establish a heterogeneous network based on metabolites-disease associations network, metabolites similarity network and diseases similarity network. According to different paths with different lengths, KATZMDA searches on a heterogeneous network and computes a final score for each pair of metabolite and disease which could estimate whether the disease has association with the metabolite or not. Experimental results testify the superior performance of KATZMDA compared with other methods in this study. There are some advantages as follows. Firstly, considering the characteristic of data, KATZ algorithm is applied in predicting associations of metabolites and diseases, which lays a foundation for the effectiveness of our final predictions. Secondly, we add properties of topology and biology in disease similarity networks. Simultaneously, we set an adaptive parameter to balance the two kind of properties in order to better explore the potential relationships.
Although better prediction results are obtained by KATZMDA, some limitations still can't be neglected. For the original data, the associations proved between metabolites Fig. 4 The network between the prediction of metabolites and two kinds of neighbors. This graph shows which of these two kinds of neighbors have more contributions to the prediction of metabolites. Rectangle represents the diseases. Green color represents the neighbors of known metabolites about Cerebral infarction whose ranks of similarity are top 20 and the associations between them. Yellow color represents Cerebral infarction and its relevant metabolites. Blue color represents the neighbors of Cerebral infarction and their relations whose similarity scores are above 0.6. Purple represents the predicted metabolites about Cerebral infarction and the black edges represent the links between the neighbors of Cerebral infarction and the predicted metabolites about Cerebral infarction  What's more, the similarity of metabolite-metabolite pairs, one of significant factor to guarantee the accuracy of result in theory, only has few contributions to the prediction (see Fig. 4). Therefore, we need to take their biological characteristics besides topological characteristics into consideration in the future.

Conclusions
According to mining a great deal of useful resources about metabolites and diseases, we can get reliable prediction scores to generate new hypotheses between metabolites and diseases by our methods, which may be of benefit to identify new research trends and boost interdisciplinary studies. The experimental results indicated our method is powerful. Moreover, three common diseases are used to be analyzed which deeply demonstrates applicability of the method. Uncovering metabolite-disease associations  are of great significance in understanding disease mechanism's and advancing biology through integrated interdisciplinary research.

Methods
Human metabolite-disease associations network The known metabolite-diseases associations are extracted from the Human Metabolome Database(HMDB) which has abundant information about small molecule metabolites found in the human body [7]. In this study, we download the data about HMDB and extract the associations between metabolites and diseases. Considering that we need to use disease semantic similarity in our method, then we select the diseases with DOID and its relevant metabolites from the associations which has been extracted. Finally, 4537 metabolite-diseases associations are extracted from the initial data, which consist of 216 diseases and 2262 metabolites to be established the known metabolite-disease associations network(see Fig. 5). For the sake of simplicity of expression, an adjacency matrix M(nd*nm) is constructed to describe metabolite-disease associations, where nm and nd represent the number of metabolites and diseases, respectively. If a disease i has been approved to have an association with a metabolite j, then M(i,j) = 1, otherwise, M(i,j) = 0.

Disease semantic similarity
According to the Mesh Database, we can obtain some detailed information about diseases because every disease has their own unique DAG (Directed Acyclic Graph) which reflects the correlations between diseases [26]. As an example of DAG about disease D, it could be defined as DAG(D) = (D, T(D), E(D)), where T(D) is composed by disease D itself and all its ancestor diseases and E(D) is composed by direct edges from a more general term (parent node) to a more specific term (child node). Additionally, the semantic value of disease D could be calculated as follows [26,27]: where Δ is a factor affecting the semantic contribution of connecting parent node d with its child node d'. For a given disease D, there are negative correlations that the nodes far from disease D have less semantic contribution to D. Moreover, there are same semantic contributions to disease D between nodes whose positions are at the same levels [26]. Finally, DSS is used to represent disease semantic similarity matrix. The semantic similarity between disease i and j could be calculated as follows: GIP kernel similarity GIP kernel similarity is applied in the association network of biological information nodes to measure similarity based on their topological structures [28]. According to the metabolite-disease associations network and the hypothesis that similar metabolites are more likely to reflect a similar pattern of interaction and noninteraction with diseases, GIP kernel similarity of metabolites could calculated as follows [29]: where the interaction profile IP(m(i)) of metabolite m(i), a binary vector, can be gained according to whether a metabolite m(i) is associated with each disease. ω m influences the kernel bandwidth, which is calculated as follows: where n m represents the number of metabolites in metabolite and disease associations network. For simplifying experiment, ω m is usually set as 1 according to previous research [28]. Thereby, metabolites GIP kernel similarity matrix (GM) is acquired. Then, we can get a metabolite similarity network (MS) based on the GM matrix. Similar as the way to set up metabolite similarity network, the disease similarity network (DM) is established by the disease GIP kernel similarity matrix(GD) which is computed as follows [29]: According to the relevant research [30], it reveals that disease GIP kernel similarity which is transformed in logistic function enables to improve predictive accuracy. Hence, logistic function in the previous research is used [30] as follows: where a = − 15, b = log(9999) [30]. GDL represents the improved disease GIP kernel similarity.

Integrate similarity for diseases
In this part, in order to tackle the sparse data in disease semantic similarity matrix and improve the accuracy, a new similarity matric about disease (SD) is constructed which is composed by disease semantic similarity matrix DSS and improved disease GIP kernel similarity matrix (GDL). The computing formulas are as follows: KATZMDA KATZ, a set of methods to investigate the associations of society, has gradually spread in domains of bioinformatics. According to the number of paths between each two nodes and the length of each path, KATZ can calculate the score of each two nodes. The higher the score is obtained, the greater the potential correlation is. There are a great deal of experiments confirming its available performance such as identifying the latent associations of microbes and diseases, lncRNAs and environmental factors. Due to these successful experiences, the KATZMDA method has been adopted in predicting metabolite-disease associations in this study (see Fig. 6). This model in the heterogeneous network could obtain a score matrix which reflects the possible associations between each metabolite-disease pair. Generally, the paths' number of metabolite i, Fig. 6 Flowchart of KATZMDA disease j and the different length of different paths [31] needs to be taken into consideration, when we calculate the potential association between metabolite i and disease j in the known metabolite-disease associations network. M *l (i,j) represents the number of paths linking metabolite i and disease j. k represents the length of paths between metabolite i and disease j. Because of the existence of different length, we gather all paths with different lengths of metabolite i and disease j. According to the previous study [32,33], it cannot be ignored that the longer paths have lower influence than shorter between each two nodes. So we adopt non-negative coefficient δ to control the influence of different-length paths [32]. If l1 < l2, then δ 2< δ 1. Accordingly, the latent associations of each metabolite-disease pair could be expressed as Z(m i , d j ) of matrix Z: