Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

Glavaški, Mila; Velicki, Lazar

doi:10.1186/s13040-021-00279-2

Research
Open access
Published: 02 October 2021

Humans and machines in biomedical knowledge curation: hypertrophic cardiomyopathy molecular mechanisms’ representation

BioData Mining volume 14, Article number: 45 (2021) Cite this article

3394 Accesses
5 Citations
7 Altmetric
Metrics details

Abstract

Background

Biomedical knowledge is dispersed in scientific literature and is growing constantly. Curation is the extraction of knowledge from unstructured data into a computable form and could be done manually or automatically. Hypertrophic cardiomyopathy (HCM) is the most common inherited cardiac disease, with genotype–phenotype associations still incompletely understood. We compared human- and machine-curated HCM molecular mechanisms’ models and examined the performance of different machine approaches for that task.

Results

We created six models representing HCM molecular mechanisms using different approaches and made them publicly available, analyzed them as networks, and tried to explain the models’ differences by the analysis of factors that affect the quality of machine-curated models (query constraints and reading systems’ performance). A result of this work is also the Interactive HCM map, the only publicly available knowledge resource dedicated to HCM. Sizes and topological parameters of the networks differed notably, and a low consensus was found in terms of centrality measures between networks. Consensus about the most important nodes was achieved only with respect to one element (calcium). Models with a reduced level of noise were generated and cooperatively working elements were detected. REACH and TRIPS reading systems showed much higher accuracy than Sparser, but at the cost of extraction performance. TRIPS proved to be the best single reading system for text segments about HCM, in terms of the compromise between accuracy and extraction performance.

Conclusions

Different approaches in curation can produce models of the same disease with diverse characteristics, and they give rise to utterly different conclusions in subsequent analysis. The final purpose of the model should direct the choice of curation techniques. Manual curation represents the gold standard for information extraction in biomedical research and is most suitable when only high-quality elements for models are required. Automated curation provides more substance, but high level of noise is expected. Different curation strategies can reduce the level of human input needed. Biomedical knowledge would benefit overwhelmingly, especially as to its rapid growth, if computers were to be able to assist in analysis on a larger scale.

Peer Review reports

Background

Biomedical knowledge is dispersed across scientific papers and databases and is growing constantly. Biomedical literature can be seen as a large, unstructured data repository [1]. PubMed is a biomedical literature database and supports the search and retrieval of the literature [2]. Filters are used to narrow the search by different criteria (publication date, species, etc.). Each publication in the database has a unique PubMed Identifier (PMID). Medical Subject Headings (MeSH) is a vocabulary thesaurus used for indexing articles for PubMed [3]. Combinations of these and other approaches (e.g., using keywords and key phrases) can be used to constrain database queries. There are also other biomedical databases such as Pathway Commons [4], DrugBank [5], ChEMBL [6], CTDbase [7], miRTarBase [8], and many more.

Curation is the extraction of knowledge from unstructured data into a structured, computable form [9]. Molecular mechanisms can be extracted from biomedical knowledge resources by manual or automated curation [10, 11]. Manual curation consists of the synthesis and integration of information from the literature, large-scale projects, and databases [9] and represents the gold standard for information extraction in biomedical research [12]. The extracted information about molecular mechanisms can be subsequently visually represented using visual pathway editors such as CellDesigner [10]. One example of an automated approach is the “Integrated Network and Dynamical Reasoning Assembler” (INDRA), which extracts molecular mechanisms from text and biomedical databases and assembles them into executable models [13]. It contains a number of clients for accessing and using resources from biomedical databases (e.g., Pathway Commons database) and literature clients for retrieving the literature. For the extraction of molecular mechanisms from text, INDRA uses reading systems such as REACH [14], TRIPS [15], Sparser [16], ISI [17], RLIMPS-P [18], Eidos [19], etc. They extract INDRA statements, intermediate knowledge representations of extracted molecular mechanisms [13]. INDRA statements are then assembled into models [13]. The INDRA Database is built with INDRA, combining content from numerous readers and databases [20].

When the information is combined, its value increases [9]. Disease maps are comprehensive, knowledge-based representations of disease mechanisms [21]. Biomedical knowledge in the form of graphs facilitates the study of complex processes, both as visual and thereby more intuitive representations, as well as a standardized data structure that is human- and computer-readable [22].

Hypertrophic cardiomyopathy (HCM) is the most common genetic cardiac disease [23,24,25], with a prevalence of 1 in 500 people worldwide [23, 26,27,28,29]. It is characterized by marked variability in expression, ranging from asymptomatic to sudden cardiac death or heart failure [30]. In addition to the direct effects of underlying mutations, gene expression is altered by micro and small noncoding RNAs, and secondary molecular changes occur in many signaling pathways [31]. Many studies have been conducted to decipher the molecular mechanisms underlying HCM; however, genotype–phenotype associations remain incompletely understood [32].

Models made exclusively by manual curation or by automated curation have never been compared. Automated biomedical knowledge curation policies that produce disease models of higher quality are still not known.

Our aims were to compare human- and machine-curated HCM models, as well as to examine the performance of different machine approaches for the same task.

Results

Constructed models

We created six models representing HCM molecular mechanisms using different approaches and made them publicly available (Table 1). The Manual HCM model was constructed by a human, based on an extensive literature search in PubMed, using CellDesigner. The Tabular manual HCM model was created by manual transcription of species and reactions from the original Manual HCM model CellDesigner XML file to nodes and interactions of a network table in XLSX format. The INDRA-assembled PubMed HCM model was assembled automatically, using INDRA’s PubMed literature client. The INDRA-assembled PubMed+PathwayCommons HCM model was assembled automatically, using INDRA’s PubMed literature client and Pathway Commons database via INDRA’s BioPAX API. The Truncated INDRA DB model was created using INDRA Database. Only statements that were completely correctly extracted from the text were incorporated into the Truncated INDRA DB model. After applying the criteria for correctness, 9.27% of statements remained for inclusion in the Truncated INDRA DB HCM model. The INDRA DB model was created using the INDRA Database. All statements returned by the query were incorporated into the INDRA DB model.

Table 1 Constructed models

Full size table

The number of elements and interactions in models differ markedly, regardless of whether they represent the same disease (HCM). Models created by automated curation contain no compartments (Table 1).

Network analysis of the generated models

Topological analysis

Topological parameters for the networks (Table 2) and network diameter per element (Table 3) were computed.

Table 2 Topological parameters for HCM models obtained with Network Analyzer

Full size table

Table 3 Network diameter per element

Full size table

Nodes’ centrality scores

The intersections of sets containing the top 10% elements by centrality measures for each network showed low consensus in terms of centrality measures between networks (Fig. 1). The elements ranked in the top 10% by different centrality measures for each network were visualized (Table 4). Network centrality scores could not be determined for the CellDesigner XML file.

Table 4 Elements ranked as top 10% by centrality measures for each network

Full size table

The most important nodes

Consensus about the most important nodes was achieved only with respect to one element (calcium), while consensus for other most and least important nodes was lacking (Fig. 2).

Each network was represented as a packed concentric ring sorted by k-shell and gradient of nodes’ color applied based on k-shell (Fig. 3, Additional file 1). Rank and k-shell for each node of each network were calculated (Additional file 2). Cytoscape Wk-decomposition [33] could not be performed on the CellDesigner XML file.

Reliability of interactions

A different level of reliability threshold was estimated and applied for each model and, as a result, models with reduced levels of noise were generated (Table 5).

Table 5 Estimated best reliability threshold for each network and models with reduced level of noise

Full size table

Cooperatively working elements

The number of detected cooperatively working elements (functional modules) was vastly different for networks (Table 6). Models made by machines without later human intervention contained ambiguous and exogenous elements in the detected functional modules (Table 6, Additional file 3). We have proposed likely implications for the detected functional modules in HCM (Additional file 3). The Manual HCM model could not be analyzed using NCMine app [34].

Table 6 Functional modules

Full size table

Factors that affect the quality of machine-curated models

Query constraints in machine-curated models

Query based on keywords is considerably more potent than query by MeSH (Table 7).

Table 7 Number of results as a consequence of different query constraints

Full size table

The average year of publication for papers found by INDRA Database [20] query by the MeSH, used for the INDRA DB HCM model, was x̅=2010.27, with 43.75% of the papers describing research conducted on human material, 15.97% on human and other species material, and the rest being animal studies.

Reading systems’ performance

The most dominant reading system for the extraction of statements for the INDRA DB HCM model was Sparser, followed by RLIMS-P, REACH, and TRIPS/DRUM (Fig. 4). Reading systems’ extraction performance differed markedly for different reaction types (Table 8). Most extractions per statement were found for different versions of phosphorylation and translocation (Fig. 5).

Table 8 Percent of reading systems’ extractions by different reaction types in INDRA DB HCM model

Full size table

For all reading systems, the most common issue was that statements extracted had two or more critical issues (a combination of wrong elements, misleading element label, wrong interaction, or wrong direction of the interaction) in the same statement, followed by wrong element and wrong direction of interaction in case of Sparser and TRIPS reading systems (Fig. 6).

REACH and TRIPS showed much higher accuracy than Sparser (Table 9) but at the cost of extraction performance (Fig. 4, Table 9). The TRIPS reading system proved to be the best single reading system for text segments about HCM when considering a compromise between accuracy and extraction performance (Fig. 4, Table 9).

Table 9 Accuracy of Sparser, REACH, and TRIPS reading systems

Full size table

For the INDRA DB model, 44.19% of the statements extracted by the Eidos reading system (the result of 20.65% of total extractions by Eidos) were meaningless and inapplicable (Additional file 4). Those were complex statements by structure and brought puzzling noise to the model. For the statements representing simple interactions (consisting of one subject, one object, and interaction between them), Eidos extracted the possible and applicable statements.

Interactive HCM map

The Interactive HCM map is available at https://silicofcm.eu/interactive-map/. It is hosted on the MINERVA (Molecular Interaction NEtwoRks VisuAlization) platform [35,36,37] which interfaces with DrugBank [5], ChEMBL [6], CTDbase [7], and miRTarBase [8]. The majority of the proteins that have a 3D structure already resolved and available in the Protein Data Bank can be directly visualized and explored using MolArt [38], a built-in MINERVA platform visualization tool.

Plugins enable additional onsite analysis. In maps with defined pathway areas, the Gene set enrichment analysis (GSEA) plugin [37] retrieves active data overlays and performs enrichment analysis, highlighting pathways significantly enriched for data overlays. These data can be user-provided. Adverse drug reactions plugin [37] links an external data file to the corresponding map elements. Targets of drugs with identified adverse reactions are shown in the map and can be filtered. The Disease-variant associations plugin [37] indicates genes with variants associated with a given disease [37]. Map exploration plugin [37] enables focused molecular interaction network exploration (e.g., of the neighborhood of a molecule appearing multiple times in a network) [37]. Centrality plugin [39] calculates network topology values. Overlays plugin [39] automatically creates, displays, or removes multiple overlays from uploaded data files [39].