Developing a modular architecture for creation of rule-based clinical diagnostic criteria

Background With recent advances in computerized patient records system, there is an urgent need for producing computable and standards-based clinical diagnostic criteria. Notably, constructing rule-based clinical diagnosis criteria has become one of the goals in the International Classification of Diseases (ICD)-11 revision. However, few studies have been done in building a unified architecture to support the need for diagnostic criteria computerization. In this study, we present a modular architecture for enabling the creation of rule-based clinical diagnostic criteria leveraging Semantic Web technologies. Methods and results The architecture consists of two modules: an authoring module that utilizes a standards-based information model and a translation module that leverages Semantic Web Rule Language (SWRL). In a prototype implementation, we created a diagnostic criteria upper ontology (DCUO) that integrates ICD-11 content model with the Quality Data Model (QDM). Using the DCUO, we developed a transformation tool that converts QDM-based diagnostic criteria into Semantic Web Rule Language (SWRL) representation. We evaluated the domain coverage of the upper ontology model using randomly selected diagnostic criteria from broad domains (n = 20). We also tested the transformation algorithms using 6 QDM templates for ontology population and 15 QDM-based criteria data for rule generation. As the results, the first draft of DCUO contains 14 root classes, 21 subclasses, 6 object properties and 1 data property. Investigation Findings, and Signs and Symptoms are the two most commonly used element types. All 6 HQMF templates are successfully parsed and populated into their corresponding domain specific ontologies and 14 rules (93.3 %) passed the rule validation. Conclusion Our efforts in developing and prototyping a modular architecture provide useful insight into how to build a scalable solution to support diagnostic criteria representation and computerization. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0113-5) contains supplementary material, which is available to authorized users.


Introduction
Diagnostic criteria are one of the most valuable knowledge sources for supporting clinical decision-making and improving patient care [1][2][3][4]. The clinical informatics research community has been seeking a solution to standardize and computerize clinical diagnosis criteria for all clinical domains. Diagnostic criteria are usually scattered over different media such as medical textbooks, literatures and clinical practice guidelines mostly in textual format. Furthermore, they are usually described in different narrative style, granularity, term usage and inner logic. There is an urgent need to develop a standard information model specification and a unified architecture to support the diagnostic criteria modeling and representation, and thereby enabling computerization and machine interpretability. To achieve a unified architecture, the following aspects should be considered: a) an information model that supports standard representation of diagnostic criteria; b) semantic interoperability and expressivity of a knowledge representation language; c) rule-based reasoning capability over factual knowledge; and d) a standard exchange format for different layers of the architecture.
The content model of the International Classification of Diseases (ICD)-11 developed by the World Health Organization (WHO) is a structured framework that defines "a classification unit" in ICD in a standard way in terms of its components that allows computerization [5]. Under the definition of the content model, each ICD entity can be seen from different dimensions and there are currently 13 main dimensions in the content model. One purpose of the ICD-11 content model is to use different settings of these dimensions or parameters to construct different sets of diagnostic criteria, so different elements in the content model come together to define the diagnosis criteria of a particular ICD category. The ICD-11 content model entails a diagnostic criteria computerization at a high level and it has achieved consensus among the members in the ICD Revision Group, thus we regard it as a feasible framework for constructing a unified architecture for computable diagnostic criteria creation.
The Quality Data Model (QDM) [6] is an information model that describes clinical concepts in a standardized format to enable electronic quality-performance measurement in support of the Meaningful Use Program of the Health Information Technology for Economic and Clinical Health Act [7,8]. It allows electronic clinical quality measure (eCQM) developers and many clinical researchers or performers to describe clearly and unambiguously the data required to calculate the performance measure. QDM enables electronic health records (EHR) and other clinical electronic system to share a common understanding and interpretation of the clinical data. To describe different parts of the clinical care process, QDM defines many datatypes to specify the context in which each category is used. As a QDM serialization format, the Health Quality Measure Format (HQMF) [9] is a HL7 standard that formally represents a eCQM (data elements, logic, definitions, etc.) as an electronic document to support consistent and unambiguous interpretation.
Semantic Web technologies provide a homogeneous framework that enables an ontology-based modeling with the Web Ontology Language (OWL) [10] and supports rule-based reasoning with the Semantic Web Rule Language (SWRL) [11]. In a semantic web environment, OWL is a W3C recommendation for ontology description and modeling and SWRL is a rule language to formalize and represent rules to support knowledge representation and reasoning. In the present study, we utilize and evaluate OWL and SWRL-based representation languages for formalizing complex inner logic for diagnostic criteria.
The objective of our study is to develop and evaluate a modular architecture for enabling the creation of rule-based clinical diagnostic criteria leveraging Semantic Web technologies. We prototype and evaluate a number of key components of the architecture, including an upper ontology and a transformation tool. We select a collection of QDM datatypes that are commonly used in describing diagnostic criteria and integrated them into ICD-11 Content Model for building an upper ontology. We perform the rule-based criteria translation and interaction following the HQMF standard format and propose extensions where needed.

Related work
Previous studies have been conducted in integrating and formally expressing diagnostic rules from different perspectives. These rules are usually extracted from free-text-based clinical guidelines or diagnostic criteria, and integrated into computerized decision support systems to improve clinical performance and patient outcomes [12,13]. The related studies mainly include as follows.
(1)Clinical guideline computerization and Computer Interpretable Guideline (CIG) Systems. Various computerized clinical guidelines and decision support systems that incorporate clinical guidelines have been developed. Researchers have tried different approaches on computerization of clinical practice guidelines [12,[14][15][16][17][18]. Since guidelines cover many complex medical procedures, the application of computerized guideline in real-world practice is still very limited. However, the methods used to computerize guidelines are valuable in tackling the issues in diagnostic criteria computerization. (2)Formalization method studies on clinical research data. Previous studies investigated the eligibility criteria in clinical trial protocol and developed approaches for eligibility criteria extraction and semantic representation, and used hierarchical clustering for dynamic categorization of such criteria [19]. For example, EliXR provided a corpus-based knowledge acquisition framework that used the Unified Medical Language System (UMLS) to standardize eligibility-concept encoding and to enrich eligibility-concept relations for clinical research eligibility criteria from text [20]. QDM-based phenotyping methods used for identification of patient cohorts from EHR data also provide valuable reference for our work [21].
However, few studies are directly related to building a unified architecture to support the goal of diagnostic criteria formalization. In particular, the lack of a standards-based information model has been recognized as a major barrier for achieving computable diagnostic criteria [22]. Fortunately, current efforts in the development of international recommendation standard models in clinical domains provide valuable references for modeling and representing computable diagnostic criteria. The notable examples include the ICD-11 content model [5,23] and the National Quality Forum (NQF) QDM [21,24,25].

WHO ICD-11 content model
WHO developed a content model to represent the knowledge that underlies the definition of an ICD entity [23]. The content model is composed of three layers: a foundation layer, a linearization layer, and an ontological layer. The foundation layer is the core product of the ICD-11 revision that stores the full range of knowledge of all classification units in ICD. Each ICD entity can be seen from different dimensions. The content model represents each one of these dimensions as a parameter. Currently, there are 13 main parameters defined in the content model to describe a category in ICD-11, as shown in Table 1 [5]. "Diagnostic Criteria" is one of the main parameters for describing an ICD category.

NQF QDM
QDM consists of two modules: a data-model module and a logic module. The datamodel module includes the notions of category (e.g., Medication), datatype (e.g., "Medication, Administered"), attribute (e.g., information about dosage, route, strength, and duration of a medication), and value set comprising concept codes from one or more terminologies. The logic module includes logic operators, functions, comparison operators, temporal operators, subset operators. As mentioned above, the HQMF provides a standard format to render the QDM-based criteria (i.e., instance data) in XML format using a collection of templates. In a previous study [26], we evaluated the feasibility of using QDM for representing diagnostic criteria through a data-driven approach and suggested that common patterns informed by QDM are useful and feasible in building a standards-based information model for computable diagnostic criteria. In this study, we used the common patterns and selected a collection of QDM datatypes and attributes for developing an upper ontology.

System architecture
The overall system architecture for the creation of rule-based clinical diagnosis criteria is shown in Fig. 1. The system architecture contains two major modules: one is an authoring module that utilizes a standards-based information model and the other is a translation module that leverages SWRL. The first module of the architecture contains an upper ontology that supports the element organization of diagnostic criteria. We integrated a collection of manually selected ICD-11 content model elements and QDM elements informed by the analysis of real-world diagnostic criteria [26]. The first module also contains a unified web user interface that supports collecting and authoring diagnostic criteria by clinicians or subject matter experts. Standard QDM model serves as a foundation layer for the downstream translation and reasoning. All collected data elements, value sets and logic expressions of diagnostic criteria are formalized using QDM-based HQMF templates. The second module of the architecture contains a rule translation engine that converts diagnostic criteria represented in QDM/HQMF format into a domain-specific diagnostic criteria ontology and a set of rules using SWRL.

Developing a standard-based diagnostic criteria upper ontology
The purpose here is to integrate existing standard information models relevant to modeling of diagnostic criteria by expert review and manual editing. As previously mentioned, we choose the ICD-11 content model and NQF QDM as reference information models. Our work in this stage is to create the diagnostic criteria upper ontology (DCUO) through the integration of ICD-11 content model with those QDM elements commonly used in diagnostic criteria. We evaluated the distribution of the QDM elements using a collection of textual diagnostic criteria. The selection of these QDM elements was informed by the results from a previous study [26]. We selected 10 QDM datatypes and 4 QDM attributes and integrated them with ICD-11 content model-based ontology schema. Table 2 shows a list of the QDM datatypes and attributes used for the integration. We used Protégé ontology editing environment [27] for manually integrating these two standard information models into a diagnostic criteria upper ontology.
We merged these two information models manually by conducting both concept and property analysis. Specifically, ICD-11 content model contributes a well-defined concept schema with general concepts that provide a systematic view of disease, whereas the QDM contributes more specific elements of disease diagnosis. Therefore, we chose all classes defined in the ICD-11 content model as our fundamental classes of DCUO, and named them by a prefix "ICD:" in conjunction with its original class name. QDM datatypes (with a prefix "QDM") as the subclasses merged with ICD-11 content model classes. The mappings of the merged classes and integrated properties are shown  Table 3. In total, four QDM attributes are integrated into the DCUO as object properties [26]. In addition, we also defined a number of new classes in the DCUO so that they can be used for supporting the SWRL rule construction and reasoning. Currently DCUO includes 3 newly defined classes (with a prefix "DCUO"): Patient, Unit, and Evidence.

Transforming QDM templates into domain-specific diagnostic criteria ontology
The DCUO provides a schema of general diagnostic criteria representation. To build a scalable diagnostic rule translation environment, it is important to dynamically populate the DCUO with domain-specific elements and produce a Diagnostic Criteria Domain Ontology (DCDO) for a specific disease or condition, e.g. the DCDO for AMI (Acute Myocardial Infarction). DCDO extends the DCUO by adding the disease-specific sub-classes, attributes and instances, and the number of elements incorporated in a DCDO depends on the complexity of the diagnosis criteria. For example, the QDM elements extracted from the AMI diagnostic criteria are shown in an Supplementary File [see Additional file 1]. Each QDM element can be represented in a structured format using one of the HQMF templates [see Additional file 2]  To transform each QDM/HQMF template into a DCDO, we designed rule-based pipelines that support data extraction from diagnostic criteria encapsulated by a template. The parsing tool decomposed a HQMF template into different parts and populated the parsed elements into the DCDO that supports for the representation of a particular ICD disease. Therefore, to create a domain-specific DCDO, it requires three kinds of input data: the DCUO, a group of disease diagnosis criteria rendered in the QDM/HQMF templates, and a group of parsing rules for these templates. As previously mentioned, the DCUO is an upper-ontology that we created for general use and is applicable for all diagnostic criteria. And the rule-based HQMF XML template parsing and populating are core tasks in this transforming implementation. For example, Table 4 shows the QDM/HQMF templates we used to develop an AMI DCDO.
Specifically, we analyzed the structure of HQMF templates, and developed a parsing tool to extract elements of each template. Figure 2 shows two HQMF template examples and their parsing rules. The upper part of Fig. 2  And then, we created mapping rules between the extracted XML elements and the elements in the DCUO ontology.  3]. The dynamics of DCDO are reflected by the variables of a template, such as "$valueSetOID", and "displayName". There are also a number of constants that maintain the metadata of each template.

Automatic rule composition and validation
After having a DCDO ontology produced, we developed JAVA-based algorithms using Protégé OWL API and SWRL API for conducting automatic rule composition and rule validation. These APIs are also responsible for rule assembling and rule grammar checking. The SWRL syntax contains two parts: Body and Head. The Body part is also called the antecedent and the Head part is the consequent of the rule. There are 7 atom types that can be used as the components of the Body and Head: class, individual property, same individual, different individual, data valued property, built-in atom and data range [28].
Adhering to the SWRL structure and grammar, we designed a collection of translation algorithms that automatically extract the SWRL rule elements from the logic components of an HQMF XML template and then assemble these rule elements into the SWRL syntax.
For example, Table 6 shows the HQMF XML representation of a QDM-based criterion "Laboratory Test, Result: LDL-c (result < 100 mg/dL)". The criterion is composed by two HQMF templates: the template "Laboratory Test, Result" (hqmf r1 template -2. 16 1.1019.3). Our translation algorithms then automatically extract the SWRL rule elements from the logic components of the two templates and then assemble these rule elements into the SWRL syntax. The translation processing is shown in Table 7. Each rule is generated from an individual criterion. After each rule is composed, our algorithms could aggregate these individual rules with logic operators for the diagnosis reasoning.
We then tested the rule generation algorithms using 15 QDM-based criteria represented in HQMF XML format. All the 15 criteria are selected from existing clinical quality measures [29] that use the HQMF template -"Laboratory Test, Result" (hqmf r1 template -2.16.840.1.113883.3.560.1.12). We used Protégé SWRL API to validate the syntactical correctness of the generated SWRL rules. The first author assessed the semantic correctness of the generated SWRL rules through comparing the HQMF XML-based logic with SWRL rule logic and the assessment results were verified by the other three co-authors. We also evaluated the domain coverage of ICD-11 content model elements in annotating diagnostic criteria. Table 9 shows distribution of annotations based on ICD-11 content model elements. The results showed that Investigation Findings, and Signs and Symptoms are the two most commonly used element types in diagnostic criteria description. The results are consistent with the analysis we did for the QDM elements in a previous study [26]. Step 1: Insert the element "LDL-c" into DCDO as a subclass of the class "Laboratory Test, Result".

DCUO development and evaluation
Step 2: Insert the element "mg/dL" as an individual of the class "Unit", and insert the element "ev1" as an individual of the class "Evidence".
Step 3: The elements extracted from the templates are represented into the following 4 types of the SWRL atom:

Translation algorithms evaluation
To extract elements in each HQMF template, we developed a collection of JAVA-based XML parsing and mapping algorithms. The algorithms automatically extract elements from HQMF templates and convert them into corresponding DCDO elements. As the evaluation results, all 6 HQMF templates are successfully parsed and populated into their corresponding DCDO ontologies. Human-based review confirmed that the elements in the templates are correctly represented in the target ontology.
For the rule generation algorithm evaluation, in total, 15 SWRL rules were generated from 15 QDM/HQMF-based individual criteria. Table 10 shows a list of the 15 QDM/ HQMF-based criteria and the syntactic validation results using the Protégé SWRL validation tool. Of them, 14 rules (93.3 %) passed the rule validation whereas one rule (6.7 %) failed to pass. Human-based review analysis found that the failure was caused by an invalid expression '[copies]/mL' that contains special characters '[' and ']'. Human-based review also confirmed the semantic correctness of all 15 generated rules.

Implementation of a translation pipeline prototype
For implementing our rule generation algorithms, we assembled a computational pipeline (See Fig. 4). A prototype of a web-based interface is developed for the pipeline  as shown in Fig. 5. The prototype allows users to upload a QDM/HQMF file and the DCUO file, and generate a collection of corresponding SWRL rules.

Discussion
There are three main contributions in this study. First, we created two types of ontologies (DCUO and DCDO) to build a flexible knowledge representation framework for representing clinical diagnostic criteria. Second, we developed automatic rule translation methods that are built on standard data model and data structure. This will ensure that the system is extensible and can be used to enable extensive support for representation and computation of diversified diagnostic criteria. Third, system architecture supports reuse of existing standards from the perspectives of information model, terminology services and technical interface.
In this study, we developed a modular architecture to support the authoring and formalization of diagnostic criteria knowledge leveraging Semantic Web OWL and SWRL technologies. Semantic Web technologies provide a scalable framework for standards-based knowledge representation and reasoning and could support the diagnostic criteria representation in a machine-readable and interpretable way. The diagnostic criteria upper ontology and domain-specific ontology are all represented in OWL which provides a foundation for coding SWRL rules. And the rules generated from QDM HQMF-based criteria are formalized and represented in SWRL that leverages the full reasoning power of OWL DL when invoking a rule engine. For the DCUO building module, we integrated the elements of the ICD-11 content model and the QDM model based on the investigation of the concept and attribute distribution of real-world diagnostic criteria. The elements used to build DCUO are evaluated by our analysis on a number of randomly selected diagnostic criteria collected from a variety of domains. Through a manual integration, we built a fundamental knowledge organization schema that can be used to support representing diagnostic criteria for a particular domain. Current editing work is performed in the Protégé editor environment. We plan to release the DCUO ontology through the National Center for Biomedical Ontology BioPortal [30] for the community-based review and feedback.
For the rule translation engine module, the QDM-based HQMF templates are used as an intermediate standard interface that supports the communication between DCUO and those elements (e.g., individuals, standard codes and logic rules) that exist in the HQMF XML diagnostic criteria. All the rule atoms are parsed and extracted from an HQMF XML file and composed into the SWRL rules, and a DCDO is populated from the instance data encapsulated in the HQMF XML file. These generated rules and DCUO could be aggregated and used for reasoning against patient data for a particular disease in the future. Currently, our rules are evaluated on a particular QDM datatype (i.e., Laboratory Test, Result) that involves 6 HQMF templates, and the results show that the accuracy of the translation is above 90 %. Using the same principle, we consider that the HQMF parsing algorithms could achieve similar performance on the QDM datatypes in other QDM categories such as Symptom, Patient Characteristic, Diagnostic Study.
From the perspective of data standardization, following the rationale of the ICD-11 content model, the full range of different values for a given parameter would be predefined using standard terminologies and ontologies. In this study, the QDM-based criteria extracted from the CQMs took advantage of the predefined value sets in the National Library of Medicine Value Set Authority Center (VSAC) 1 . During our translation work, the value set metadata are automatically parsed and extracted from the HQMF XML file. Considering current VSAC does not include all the value sets associated with diagnostic criteria, our system architecture will support the extension of value Fig. 5 A Prototype Implementation of a HQMF to SWRL Translation Pipeline with a Web Interface set definitions in the future. In addition, we also plan to extend the web-based application functionalities as follows: 1) Ontology display and update; 2) Diagnostic criteria authoring by clinicians and domain experts, including value set services invoking and semi-automatic workflow for criteria editing.
There are a number of limitations in this study since the present study is mainly focused the feasibility of our proposed architecture. First, the DCUO (Diagnostic Criteria Upper Ontology) was reviewed for consensus and quality assurance only by a relatively small group (i.e., four authors). In the future, a rigorous ontology evaluation by a panel of experts from relevant domains will be useful in achieving consensus in terms of the vocabulary, syntax, structure, semantics, representation and context of the DCUO. We plan to use ontology evaluation methods as described by Vrandečić [31]. Second, we have not considered all complex conditions and details in the modeling of diagnostic criteria. For instance, the following problems need to be further considered.
In the QDM model, the semantics of some templates are not expressed explicitly. For example, the QDM element 'Patient Characteristic Birth Date' is used to represent the numeric value comparison of the variable "Patient Age" (e.g. <low value='18' unit='a' inclusive='true'/>), assuming the value of the variable "Patient Age" could be derived from the 'Patient Characteristic, Birth Date'. In the present study, we have implemented translation algorithms only on a limited number (n = 6) of HQMF templates and the preliminary evaluation demonstrated that the translation performed is reasonably well. However, in total, there are 186 HQMF templates from diverse domains and these HQMF templates are updated continuously, so maintaining the transportability and reusability of the translation algorithms will be a challenge. For the diagnostic criteria rule generation using SWRL, the inclusion criteria are well supported by the built-in rule grammars, such as: comparison, mathematical functions, booleans, string and date/time. We understand that some of exclusion criteria could not be explicitly expressed in SWRL due to that negated atoms or disjunctions are not supported in SWRL.
In the future, we plan to improve our work from the following aspects: 1) to improve the system functions such as the DCDO enrichment, rule generation and computerized criteria display and execution; 2) to enhance diagnostic criteria annotations using the natural language processing (NLP) so that large amount number of diagnostic criteria text can be transformed into QDM-based structured format in a semi-automatic approach; 3) to execute diagnostic criteria against large scale patient data for target diseases.

Conclusions
In this study, we developed a modular architecture for the creation of rule-based diagnostic criteria. We demonstrated the feasibility of prototyping a number of key components of our proposed architecture for diagnostic criteria knowledge modeling and reasoning. It remains a very complex field to explore and more semantic and syntactic features dealing with complexity of diagnostic criteria need to be further studied. Our efforts provide useful insight into how to develop a scalable, semantic-oriented and standards-based solution to support diagnostic criteria formalization and computerization.