The difference in the number of nodes and interactions between the original Manual HCM model in CellDesigner XML format and its uploaded version is caused by the incompatibility of the Cytoscape  and CellDesigner XML formats. The incompatibility is also evident from visual inspection of the network uploaded to Cytoscape/NDEx [41,42,43], where empty elements (reactions represented as nodes) constitute 53.95%. The inaccurate number of elements and misconstructed visual representation raised questions regarding the reliability of CellDesigner XML format in any Cytoscape analysis.
Visual inspection of networks revealed a weakness of the machine-curated models: the absence of compartments, which can be important for diseases like HCM, where a molecular signal is context-specific (organelle, cell, tissue, organ).
When the number of elements and interactions in models is taken as a criterion, the machine-curated models proved to be a richer source of information. Whether that abundance is noise or a broader view of the topic is yet to be determined.
The general problem of machine-curated models is the misleading labeling of the elements. Abbreviations like LV (a common abbreviation for the left ventricle in HCM articles) are turned into amino acid sequences (Leu-Val). Elements starting with Greek letters (e.g. α-adrenergic receptor) are turned into labels that consist of Greek letters only (e.g., α).
Network analysis of the generated models
Comparing the original Manual HCM model in CellDesigner XML format and the same model (same elements and interactions) transcribed to the network table, we got different values for topological parameters in network analysis for all relevant measures. Taken together with the unsatisfactory result of upload for the model in CellDesigner XML format, we suggest that, although this format is readable by some Cytoscape tools, it should not be used for network analysis.
The average number of neighbors is the highest in the INDRA-assembled PubMed+PathwayCommons HCM model and the lowest in the Truncated INDRA DB HCM model. That is as expected because the INDRA-assembled PubMed+PathwayCommons HCM model is built using “neighborhood” query for the list of genes associated with HCM. “Neighborhood” query returns the neighborhood around a set of source genes , which is then incorporated in the model—it adds both elements and their neighbors to a model at the same time. The choice of the Truncated INDRA DB HCM model statements was based only on the correctness of a limited set of statements, so the discontinuity (manifested also as a lack of neighborhood connections) in the model was expected. All other models have a comparable average number of neighbors, with an element usually having two neighbors.
Network diameter indicates how distant the two most distant nodes are. It is a parameter of graph “compactness” (overall proximity between nodes) . In order to compare the compactness of graphs of different sizes, we determined the network diameter per element. The Tabular manual HCM model was far more compact than the machine-curated models. At the same time, network diameter per element for the Manual HCM model had the lowest values, probably due to incompatible format.
Characteristic (average) path length represents “closeness” in a network . It is defined as the average distance between all pairs of its nodes . The characteristic path length is largest for the Tabular manual HCM model, closely followed by the INDRA DB HCM model, INDRA-assembled PubMed HCM model, INDRA-assembled PubMed+PathwayCommons HCM model, and Truncated INDRA DB HCM model. Characteristic (average) path length for the Manual HCM model has value 1, which is probably the result of incompatible CellDesigner XML format.
Clustering coefficient is a measure of local cohesiveness . The clustering coefficient of a network is the average of all its individual clustering coefficients . It is the largest for the Tabular manual HCM model. The Manual HCM model has a clustering coefficient of 0.0.
Network density is the number of existing relationships relative to a possible number. Dense networks are more important for control than for information. Dense networks tend to generate a lot of redundant information. Large networks tend to be sparse .
Nodes’ centrality scores
There was no consensus between networks about the top elements in terms of centrality measures. This result is partially a consequence of diverse labeling between models, along with inconsistent labeling within models. Some rare elements were found as intersections of these sets, but they reflect the combination of the same principle for labeling, simultaneously with consistency about the highest values of centrality measures. Conclusions regarding the consensus turned out not to depend on the choice of centrality measure. The effect of different number of elements in networks on centrality measures and consequent comparison of top 10% of nodes is hard to predict and generalize, and could be the subject of a future research. Although this issue is partially and roughly resolved by using the same proportion of the elements (10%), the consensus between networks about the top elements in terms of centrality measures is affected by number of elements in networks, with impact and magnitude that are yet to be estimated.
The most important nodes
Although the actually important nodes are estimated as important ones for all the models, the INDRA-assembled PubMed+PathwayCommons HCM model had the most less-expected elements estimated as being the most important ones.
For all models, among the group of elements estimated as the least important, most of the nodes are indeed less important for HCM. However, in the same group, there were some elements that are considered as important. We suggest that happens because of diverse labeling of closely related or same elements. K-shell decomposition algorithm assigns a weight based on the degree of a node (number of connections that it has to other nodes) and the adjacent nodes. Accordingly, diverse labeling makes these elements scattered, and thus less connected.
Venn diagrams for the most important nodes of all networks revealed that a consensus is achieved with respect to calcium, while other 95 percentile bucket elements were rarely the most important in a few models.
Venn diagrams for the least important nodes of all networks revealed that there is no consensus about the least important elements either, which is as expected because those elements represent noise or additional (non-essential) information.
In an interpretation context, wk-shell-decompositions and measures of centrality both tell us about importance of a node, but wk-shell-decompositions and each of centrality measures have different criteria of what is important and how is it estimated (i.e. calculated).
Reliability of interactions
The PE-measure tool  demonstrated useful noise reduction in networks, especially in the INDRA DB model. We suggest that the combination of INDRA DB and PE-measure (or equivalent) tools could be beneficial for other disease models as well. The estimated best reliability threshold could also serve as a rough assessment of the level of noise in models. In this respect, the INDRA-assembled PubMed+PathwayCommons HCM model and INDRA DB model contain much more noise than the Tabular manual HCM model, INDRA-assembled PubMed HCM model, and especially the Truncated INDRA DB HCM model (which has the lowest estimated reliability threshold).
At the moment, there is no strict, straightforward, nor objective way to estimate where the border between the clutter and definite molecular elements involved in the disease is.
Disease modelers interested in domain knowledge consistency of models might be interested in what do combinations of the applied noise-removal technique and each of these model-generation techniques could bring, since model-generation techniques do not all generate same type of clutter.
Cooperatively working elements
Most of the determined functional modules (cooperatively working elements) are possible and relevant for HCM (Additional file 3). All the machine-curated models contained ambiguous elements (due to imprecise labeling), except the Truncated INDRA DB, for which before construction such elements were excluded. All machine-curated models contained exogenous elements, except the Truncated INDRA DB. In the INDRA-assembled PubMed+PathwayCommons HCM model, functional modules containing exogenous elements dominated. Although these functional modules do not represent HCM itself properly, this approach could be interesting in cases where interactions between diseases and external factors are studied.
Factors that affect the quality of machine-curated models
Reading systems’ performance
We propose assigning weights to statements extracted by a reading system that is favorable with regard to a particular use-case instead of giving preference to more numerous identical statements extracted. The choice of the reading system (and proposed weighting) is a trade-off between quantity and quality and could be guided by the molecular context and type of reactions important for a disease.
Although the RLIMS-P reading system demonstrated higher statement extraction performance, it is specifically designed to extract protein phosphorylation information. Favoritism of RLIMS-P due to its high extraction performance and, consequently, a large volume of phosphorylation statements should be revised for each disease of interest individually. Phosphorylation is the most common post-translational protein modification, and a key component of signal transduction . However, statements about phosphorylation in HCM overshadowed other reaction types in the INDRA DB. Although we cannot pinpoint the exact contribution of phosphorylation to HCM mechanisms, especially in terms of understudied (“dark”) kinases , our suggestion is that phosphorylation statements should be dosed based on the model purpose. When models are built to enable hypothesis generation, abundance of phosphorylation statements is useful; when the purpose is to find key elements, they could produce an imbalance in the analysis.
Query constraints in machine-curated models
In HCM query by MeSH, the average year of publication is 10 years apart from the current research, which makes a difference in the overall representation of HCM, as more recent HCM research has brought in a whole additional quantum of knowledge. Moreover, query by MeSH returned a lot of animal studies, which are mostly aggregating noise in models for diseases like HCM, where animal models do not fully replicate human HCM . For those reasons, we suggest that, for machine-curated models, the best approach to finding elements for HCM models is to query by keywords. Relying on MeSH, both fully or partially, should be avoided. HCM research tagged with MeSH is usually basic research, whereas HCM applied research is usually easier to find using keywords.
Interactive HCM map
The interactive HCM Map is both human- and machine-readable and represents a platform for sharing and gathering molecular mechanisms of HCM and a standalone basis for in silico exploration. It also serves as a template for uploading and visualizing multiple datasets. It is the only publicly available knowledge resource dedicated to HCM.
To the best of our knowledge, this is the first attempt to compare human and machine-curated disease models and examine how the choice of different query constraints in machine approaches can affect disease modeling.
Hoyt et al. (2019) manually evaluated 2989 statements generated by INDRA using REACH and Sparser readers containing studied genes from MEDLINE abstracts and PubMed Central full-text articles, following which 30.7% of statements were marked as correct, 48.6% required manual correction, and 20.7% could not be corrected. The criterion for correctness was that “all” aspects of the statement, including the subject and object entities, relationship type, phosphorylation, and other post-translational modifications, were extracted to the same extent as careful manual curation could. The authors identified errors in BEL statements extracted from INDRA. The most common error was wrong name entity recognition. Other common errors were the improper assignment of the subject and object, semantic incorrectness due to the presence of a negation word, and errors arising from evidence that did not actually include relations between the subject and object entities .
Allen et al. (2015) showed that the DRUM system (Deep Reader for Understanding Mechanisms, a version of the general-purpose TRIPS NLP system customized for extraction of molecular mechanisms from biomedical text) has performance (precision and recall) close to human experts in extracting the molecular mechanisms from text, and it was the best performing system among those evaluated. The same authors also found high precision among human biologists, but considerable non-overlap in the answers they provided. That accounted for the approximately 0.50 recall for either of the human teams they observed, using the pooled answers of the two teams as the gold standard .
Cohen et al. (2015) carried out a test with two expert human biologists and reading systems. Their task was to identify as many relationships as possible between six text passages and a prior model. Four kinds of relationships between texts and prior models were probed: the text might corroborate or contradict something in the model; it might introduce a new mechanism or a new relationship between entities in the model. Before the test began, biology experts on the evaluation team prepared a gold standard—a list of assertions. Recall was defined as the fraction of relationships that should have been found that were actually found, and precision as the fraction of the relationships found that were in the gold standard. The two expert human biologists’ recall scores were less than 0.5 (they failed to notice roughly half of the relationships between the texts and the prior models). However, their precision was very high: 0.86–1.00. They noticed different relationships, they disagreed with each other. They also noticed some relationships that the evaluation team had not. For the same task, the best recall score for a reading system was 0.4 with an associated precision score of 0.67. The least effective system achieved 0.03 recall at 0.33 precision. The authors assumed that human expertise probably includes an ability to not notice assertions that are “obvious” or “unimportant” .
Allen et al. (2018) studied how different extensions and customizations of the TRIPS parser affected performance . Bose et al. (2020) used decisions from a statistical word sense disambiguation system SupWSD to advise the logical semantic parser TRIPS. Significant improvement across all metrics was found using this approach, with roughly 14% improvement to raw accuracy, although the research was not conducted on biomedical literature specifically .
While other authors have focused on reading systems’ performance as parsers (precision, recall, and F1 score—often defined differently), we focused on their potential to build models that would be equal to the models built by humans: containing reliable information (accuracy of extracted statements, based on human estimation) and providing complete information (extraction performance). We believe that the reliability of the information is the principal aspect of any reading system for biomedical knowledge curation.
Interactive disease maps have so far been generated for Alzheimer’s disease , cancer , Parkinson’s disease , influenza A virus replication cycle , rheumatoid arthritis , asthma [61, 62], inflammation , and others.
CellDesigner XML format should not be used for network analysis in Cytoscape. A higher level of interoperability between CellDesigner XML (and related) and INDRA generated formats and platforms would be useful because only in that case would direct comparison or better complementation of human- and machine-curated models be possible.
In machine-curated models, query constraints strongly affect the final disease models, so they should be chosen carefully and according to the purpose, with complete information about the advantages and disadvantages that each approach brings. Although we have shown that the PubMed database is a reliable source of information for human reading, the REACH reading system is equally or more accurate than other reading systems, and we suggest that a period of “last 10 years” is optimal for HCM research; the strategy that unites all these components derived a suboptimal (noisy and containing blurred key pathways) HCM model. More research is required, about the advantages and disadvantages of particular query constraints and their combinations for machine-curated models.
There is an urgent need for quality control criteria for disease models. Owing to the many techniques available for generating disease models, the formalization of minimal requirements for adequate quality of disease models or definition of methods for estimation of the quality of disease models are necessary. Such an approach could also accelerate and direct the development of more sophisticated techniques for building useful and representative disease models.
The Interactive HCM Map represents the body of knowledge available today, a summary of all major molecular pathways involved in HCM. Since some molecular mechanisms underlying HCM are still unknown, more interactions have yet to be identified. The HCM map will be constantly updated and improved, involving the community of HCM signaling experts.
Although our goal was a comprehensive comparison of models produced by different approaches (as a whole, by the most central and important elements, by the reliability of interactions and the level of noise they contain, as well as by cooperatively working elements), there is no single correct way to compare models and their quality. Moreover, since the molecular mechanisms underlying HCM are still only partially understood, we cannot claim that some interactions are more important or less possible—we can only assess the extent to which results are in line with the literature. Our analysis covered only the first phase of biomedical knowledge curation (and not the subsequent manual, semi-automatic, or automatic re-curation), so as to isolate only the effects of the selections made in this phase. Since we studied only one disease, we cannot generalize our findings to all diseases and models. In manual disease modeling, different persons cannot produce completely consistent results. Consequently, our results show the features of a single manual model made by a particular person rather than features of manual disease modeling itself. Currently there are no criteria for the diverse characteristics of different models.
The rapid growth and accumulation of biomedical knowledge demands its structuring so that computers can assist in its interpretation  and comprehensive understanding. Disease models still need plenty of human input in the curation or re-curation phases, although semi-automatic or automatic re-curation options are emerging and can reduce time-consuming manual effort. Our results show how better performance can be attained even without the development of highly complex technologies. Selections made in the first phase of biomedical knowledge curation can affect overall performance. Our results show the effect of different strategies (techniques, query constraints, and reading systems) that should be considered in this phase. This evaluation also identified approaches that could be combined in order to achieve a specific goal of disease modeling. We anticipate that these results could be helpful for developers of the reading systems and model assemblers and may improve performance.
Manual curation represents the gold standard for information extraction in biomedical research  and is most suitable for models that will be used as a base for mathematical models generation, because only high-quality elements will be incorporated into the model. On the other side, manual curation is time- and effort-consuming. Automated curation is useful in situations where the more elements is the better, such for new hypothesis generation, because it provides more substance.
INDRA’s BioPAX API for the Pathway Commons database query is useful in automatic approach when paths between sets of genes are important and especially when microRNAs should be included in the model. INDRA’s PubMed literature client is favorable when focus is on available biomedical literature. INDRA Database is preferable when all available information is needed. All automated approaches generate a high level of noise. Although we expected the best results when the two approaches were combined: use of INDRA Database (expected to provide a high volume of information) with latter human intervention (expected to rigorously remove the clutter), in our case the model generated was too disconnected to be useful. In our case, the best automated approach for finding molecular mechanisms from clinical research was to query by keywords, while for finding elements from preclinical research query by MeSH was better. The PE-measure tool  demonstrated useful noise reduction in networks.