We aim to learn rules similar in form to (1), except each condition on the left-hand side will correspond to an assumed ontological term. Thus the logical conjunction will simply correspond to a set of terms. Rules will be searched only for the positive class, as any example not classified as positive is deemed negative by default (we work in the binary classification setting). Thus the class symbol in all rules will indicate the positive class, and we can drop the right-hand side of rules. Therefore, a rule in our context is simply a set of terms.
Our goal is to find a set of rules which fit well a supplied training set as described above in the context of the CN2 algorithm. To this end, we introduce a special refinement operator that, due to the taxonomic nature of the assumed conditions, significantly reduces the search space of rules and consequently reduces run times of the rule learner in comparison to the traditional refinement operator without a loss of accuracy. For example, if term t1 is in the rule and the ontology prescribes that t2 is more general than t1 then adding t2 to the rule is obviously useless. We can thus safely prune from the search space all rules combining t1 and t2.
Technically, the proposed ontology-based refinement operator uses two reduction procedures: a Redundant Generalization that omits candidate rules based on a relation generalization-specialization and a Redundant Non-potential that omits the candidate rules which cannot improve classification accuracy.
Problem formalization
To describe our rule-learning algorithm in detail, we first define a few formal concepts. We are given
-
Two sets E+,E− of positive and negative (respectively) examples.
-
A set T of ontological terms with a partial order ≽ which encodes the “more general than" relation. For example, with t1=biological process and t2=developmental process, we have t1≽t2.
-
An annotation function M which maps each example to a subset of T, i.e. M:E+∪E−→2T.
From M, we can derive a reverse mapping M′:T→2E producing the set of examples annotated with a given term, i.e. M′(t)={e∈E:t⊆M(e)}. It is also useful to define the transitive closure S(t) of M′(t) as the set of all examples annotated by t or any term less general than t, i.e.
$$ S(t) = \bigcup_{t' \in T, t \succeq t'} M(t') $$
(2)
If t is the only term in a rule, then S(t) is the set of all examples for which the rule predicts the positive class. S(t) is also called the cover of the rule. More generally, for a rule conjoining an arbitrary set R⊆T of terms, we define the cover function as
$$ \Theta(R) = \bigcap_{t \in R} S(t) $$
(3)
Finally, we define a generality relation ≽r on rules. Let R1,R2⊆T, then R1≽rR2 if and only if Θ(R1)⊇Θ(R2).
Example 1
Consider 3 hypothetical examples and 7 actual ontology terms as shown in Fig. 1. The term generality relation ≽ corresponds to the direction of edges from more to less general. Here we have M(e1)={t4},M(e2)={t5,t6},M(e3)={t2}. M′(t) is shown above each t box. Finally, S(t)=M′(t) for t∈{t4,t5,t6} but e.g. S(t1)=M′(t1)∪M′(t4)={e1}.
Proposed algorithm
The algorithm proposed in this work induces a hypothesis from data in the form of a set of rules. To induce a hypothesis consisting of more rules we apply a covering algorithm that has its origin in the AQ family of algorithms [29] and it is also used in CN2. The covering algorithm consists of two steps: (1) induce a single rule from the current set of examples, (2) exclude the examples that are covered by this single rule from the current set of examples; these two steps are iteratively applied starting with the set of all examples until all positive examples are covered or a certain number of induced rules is reached. This process is described in Algorithm 1 and that algorithm we refer to as sem1R. As an input, the following data are required: a set of positive E+ and negative E− examples, a set of ontologies \(\mathcal {O}\), and a maximal size of the set of induced rules k. An output is a set of induced rules. An induceSingleRule function returns the best rule based on selected evaluation function. The function induceSingleRule is described in Algorithm 2, all evaluation functions can be found in the “Evaluation criteria” section.
Contrary to CN2, the sem1R algorithm has the relations over terms that are explicitly specified in provided ontologies. Intuitively, if this kind of knowledge were exploited then we would expect some benefits during the process of inducing rules because the structure of terms is known. In this case, the main benefit is speeding up the process of inducing rules and removing obvious redundancy between the terms in rules. This was the main motivation for the following reduction procedures.

Reduction procedures
In this section, we formulate two procedures that significantly reduce a rule space in comparison with the traditional rule learning methods such as CN2.
Redundant generalization
This reduction method eliminates such terms occurring in a rule which are more general than any other term of the rule. Such terms in the rule do not affect a set of examples covered by the rule and consequently do not change its impact. Evidently, the set of covered examples is only affected by the most stringent sets of examples according to the mapping S.
Theorem 1
Let R1 be a rule and suppose that term t1∈R1 and a term t2∈R1 where t1 is more general than t2. Then, the rule R1 covers an equal set of examples as a rule \(\overline {R1} = R1 \backslash \{t1\}\) that does not contain t1:
$$\Theta(\overline{R1}) = \Theta(R1)$$
and the rule R1 is called a redundant generalization of \(\overline {R1}\).
Proof
For simplicity, we take into consideration only rules with cardinality 1. Given this, mapping S can be seen as a cover operator Θ because it only makes an intersection over all sets of examples according to S. Also, a rule of cardinality 1 will be denoted as a term because we do not want to distinguish the relations over the set of terms and the set of rules. In this case, the ≽ relation over terms is equivalent to ≽r relation over rules. This simplification does not lose generality.
A term cannot be associated with a higher number of examples than its more general counterpart. Concurrently, examples associated with a more specific term make a subset of examples associated with a more general term, written as t1≽t2⇒S(t2)⊆S(t1) where t1,t2∈T. Now, let rule R1={t1,t2} consist of two terms such that t1≽t2 and rule \(\overline {R1} = \{t2\}\) consists of only term t2. Then R1 covers an equal set of examples as \(\overline {R1}\). This equality is proven below.
$$\Theta(R1) = \Theta(\overline{R1})$$
$$S(t1) \cap S(t2) = S(t2)$$
$$\{e \in E: S(t2) \subseteq S(t1)\} = S(t2)$$
□
Example 2
Consider the ontology O and mappings M,M′,S from Example 1. Let rule R1={t0,t2}, term t0 is more general than t2 (t0≽t2) and this rule covers examples e1,e2,e3 because Θ(R1)=Θ({t0,t2})=S(t0)∩S(t2)={e1,e2,e3}. Now, consider a rule \(\overline {R1} = \{t2\}\) that also covers examples e1,e2,e3 since \(\Theta (\overline {R1}) = S(t2) = \{e1, e2, e3\}\) and as we can see, term t0 occurring in the rule R1 does not influence a set of covered examples. Given this, rule R1 covers the same set of examples as rule \(\overline {R1}\). For this reason, rule R1 is Redundant Generalization and rule \(\overline {R1}\) is not Redundant Generalization.
To achieve a non-Redundant Generalization rule, i.e. the rule where the relation ≽ does not exist between any terms in the rule, we have to apply Redundant Generalization procedure until the relation ≽ between terms in the rule has not been found. As we can see in Example 2, this reduction procedure decreases the cardinality (length) of the rules.
Redundant non-potential
In the previous case, the Redundant Generalization method reduces a rule space as a result of its ability to decrease the cardinality of rules. Specifically, this reduction capability is applied to the refinement operator that gradually extends rules by adding new terms into them. Redundant Generalization method can generate fewer candidate rules because terms that are in a relation with another term are not appended to the refined rule.
Contrary to the previous method, the Redundant Non-potential method does not utilize relations among terms to reduce a rule space but compares rules with each other and removes such rules that cannot reach a higher quality value than the current best rule has. The ability to recognize non-potential rules can be used for a direct reduction of rules in a rule space and also for eliminating a number of candidate rules in a process of rules refining. Firstly, we define two types of evaluation function: Q evaluating a quality of rule based on the number of covered/uncovered examples, and Qp that evaluates a potentially maximum quality of rule that could possibly be achieved over its future refinements. Examples of Q functions are depicted in Eqs. 11, 13, and 15. Corresponding Qp functions are depicted in Eqs. 12, 14, and 17. For the moment, we can say that Qp function expresses an upper boundary of a rule quality. This upper bound can be reached when we know that rule refinements can only reduce the set of examples the rule covers. Then, the best potential refinement does not lose any positive examples from the current cover while ceasing to cover all the current negative examples. A Redundant Non-potential rule and all its more specific rules can be safely disregarded in the single rule induction process because there is a guarantee that these rules cannot exceed an upper boundary of the rule quality represented by Qp.
To illustrate, consider an arbitrary rule R1 and its more specific rule R2 (R1≽rR2) which was created by refinement operator application. R2 covers a subset of examples covered by R1 (Θ(R2)⊆Θ(R1)). Unfortunately, ACC or F1-score are not monotone functions, meaning that it is not guaranteed that R2 must always have a higher ACC or F1-score than R1. For this reason, R2 cannot be safely pruned from a rule space because it is not obvious whether other refinements of R2, which are more specific than R2, can potentially achieve a higher score than R1 even though R2 could have a worse score than R1. To prune the rule space safely, we maintain the upper bound of rule quality Qp. Given this, if rule R2 (refinement of R1) has a lower Qp value than R1’s value of Q then R2 is a Redundant Non-potential and this rule, along with all its more specific extensions/refinements, can be safely pruned from a rule space.
Theorem 2
Let \(\mathcal {R} = <R, \succeq _{r}>\) be a quasi-ordered set representing a rule space, where R={R1,R2,Rbest}. Binary relation ≽r is defined on R1 and R2 as ≽r={(R1,R2)} meaning that R2 is more specific than R1; relation of Rbest is disregarded - may be arbitrary. If potential quality (Qp) of the rule R1 is smaller than the quality Q of rule Rbest then the rule R1 and all its potential more specific rules, i.e. R2, can be pruned from the set of rules R thus from the rule space \(\mathcal {R}\). Then the rules R1 and R2 are called Redundant Non-potentials.
Proof
First of all, suppose that a target class is represented by positive examples. Secondly, suppose an evaluation function whose highest value is returned when all positive examples and none of the negative examples are covered. An example of this function can be ACC or F1-score. Note, that ACC is given by equation TP+TN/(TP+TN+FP+FN) (see the “Evaluation criteria” section) and the reason, why we affect only TP and not TN, is simple. An example that is classified as TP has to be covered by a rule. On the other hand, an example classified as TN does not have to be covered by a rule. Since we focus on the target class, an arbitrary rule reaches a higher score if a new rule covers the same set of positive examples as a rule and does not cover any other negative example. □
Example 3
Consider the ontology O and mappings M,M′,S from Example 1, and two rules R1={t2} and R2={t3}. Further, we define a set of positive examples E+={e1,e3} and a set of negative examples E−={e2}. Firstly, we evaluate the quality of the rules according to ACC measure (see Eq. 11)
$$\begin{array}{@{}rcl@{}} {Q_{ACC}(R1) = \frac{TP + TN}{TP + TN + FP + FN} = \frac{2 + 0}{2 + 0 + 1 + 0} = \frac{2}{3}} \end{array} $$
(4)
$$\begin{array}{@{}rcl@{}} Q_{ACC}(R2) = \frac{TP + TN}{TP + TN + FP + FN} = \frac{0 + 0}{0 + 0 + 1 + 2} = 0 \end{array} $$
(5)
Now, we compute a potential quality score of R2 (see Eq. 12):
$$\begin{array}{@{}rcl@{}} Q_{p\_ACC}(R2) = \frac{TP + TN + FP}{TP + TN + FP + FN}= \frac{0 + 0 + 1}{0 + 0 + 1 + 2} = \frac{1}{3} \end{array} $$
(6)
Evidently, the potential quality of R2 is smaller than the quality of R1 so we can exclude the rule R2 and all its more specific rules (e.g. {t5,t6}) from the rule space. Note that an example of how to compute evaluation measures can be found in the next section.
To achieve the most effective pruning of rule space, we store a value of the highest quality rule that has been discovered during the learning process in \(\mathbbm {R}_{BEST\_SCORE}\) variable, see Algorithm 2. If the potential quality (Qp(R)) of currently examined rule R is less than the \(\mathbbm {R}_{BEST\_SCORE}\), then the rule R and all its more specifics rules are Redundant Non-potential and can be excluded from a rule space.
Evaluation criteria
It is necessary to know the quality of each rule because the rule with the highest value is needed for the final hypothesis. In this case, we define three evaluation functions: accuracy (ACC), F1-score (F1), area under the ROC curve (AUC), and their adjusted versions for evaluating the potentially best results that the current rule can achieve after refinements in future evaluations. Accuracy works well for balanced problems (the number of positive examples is similar to the number of negative ones) and both classes are equally important. F1 and AUC help when dealing with imbalanced classes, F1 puts more emphasis on the positive class.
First of all, we define four elements of confusion matrix: number of true positives (TP), number of false positives (FP), number of false negatives (FN), and number of true negatives (TN) examples that are covered by an arbitrary rule R, see Fig. 2.
TP is given as a cardinality of the intersection of two sets, a set of examples that are covered by the rule R and a set of positive examples E+. FP is given as a cardinality of the intersection of two sets, a set of examples that are covered by the rule R and a set of negative examples E−. TN is given as a cardinality of the subtraction of two set, a set of negative examples E− and a set of examples that are covered by the rule R. Finally, FN is given as a cardinality of subtraction of two sets, a set of positive examples E+ and a set of examples that are covered by the rule R. All equations are shown below.
$$\begin{array}{@{}rcl@{}} TP = |\Theta(R) \cap E^{+}| \end{array} $$
(7)
$$\begin{array}{@{}rcl@{}} FP = |\Theta(R) \cap E^{-}| \end{array} $$
(8)
$$\begin{array}{@{}rcl@{}} TN = |E^{-} \backslash \Theta(R)| \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} FN = |E^{+} \backslash \Theta(R)| \end{array} $$
(10)
Corresponding accuracy (ACC) of an arbitrary rule R can be computed by the widely known equation below:
$$\begin{array}{@{}rcl@{}} Q_{ACC}(R) = \frac{TP + TN}{TP + TN + FP + FN} \end{array} $$
(11)
However, the potentially highest accuracy of rule refined from R is computed differently. In Eq. 11, we see that the eventual accuracy is given by the numerator (TP and TN) whereas the denominator has the normalization function. The refinement may improve the rule quality in such a way that the examples that are classified as FP will be re-classified to TN, i.e. the numerator of \(Q_{p\_ACC}\) may at best be given by the sum of TN, TP, and FP. The equation for the potentially highest quality reached through refinement follows:
$$\begin{array}{@{}rcl@{}} Q_{p\_ACC}(R) = \frac{TP + TN + FP}{TP + TN + FP + FN} \end{array} $$
(12)
The computation of \(Q_{p\_ACC}\) in Eq. 12 assumes that the rule R aims to cover positive examples rather than negative ones. In other words, examples that are covered by the rule R are classified as positive. Secondly, we propose another evaluation measure that is based on F1-score that implicitly does not take into account the number of TNs. Its common form is depicted in Eq. 13.
$$\begin{array}{@{}rcl@{}} Q_{F1}(R) = \frac{2 \times TP}{2 \times TP + FP + FN} \end{array} $$
(13)
The corresponding version of potentially best accurate rule created by applying refinement operator to rule R that is based on the F1 measure takes the following form:
$$\begin{array}{@{}rcl@{}} Q_{p\_F1}(R) = \frac{2 \times TP}{2 \times TP + FN} \end{array} $$
(14)
where all negative examples covered by rule R (FP) are excluded from the denominator in comparison with Eq. 13. Since there is still the possibility of finding such a rule which covers all examples determined as TP and none of the FPs.
Example 4
Consider the ontology O and mappings M,M′,S from Example 1, and a set of positive (E+) and negative (E−) examples from Example 3. Further, we define a rule R={t2}. First of all, we find examples that are covered by the rule using Θ operator, i.e. Θ({t2})=S(t2)={e1,e2,e3}. Secondly, we compute TP, FP, FN and TN:
$$TP = |\Theta(r) \cap E^{+}| = |\{e1, e2, e3\} \cap \{e1, e3\}| = 2$$
$$FP = |\Theta(r) \cap E^{-}| = |\{e1, e2, e3\} \cap \{e2\}| = 1$$
$$TN = |E^{-} \backslash \Theta(r)| = |\{e2\} \cap \{e1, e2, e3\}| = 0$$
$$FN = |E^{+} \backslash \Theta(r)| = |\{e1, e3\} \cap \{e1, e2, e3\}| = 0$$
Finally, we substitute these numbers in Eqs. 11 and 12:
$$Q_{ACC}(R) = \frac{TP + TN}{TP + TN + FP + FN} = \frac{2 + 0}{2 + 0 + 1 + 0} = \frac{2}{3}$$
$$Q_{p\_ACC}(R) = \frac{TP + TN + FP}{TP + TN + FP + FN}= \frac{0 + 0 + 1}{0 + 0 + 1 + 2} = \frac{1}{3}$$
The final ACC of rule R over the set of positive and negative examples is \(\frac {2}{3}\) and the potential best ACC for the set rule and the set of examples is \(\frac {1}{3}\).
Finally, let us give the rule quality in terms of AUC. The area under the curve can be computed easily. Since only the single rule is taken into consideration, its quality is determined by a single point in the ROC plot and it can be computed as a sum of areas of two triangles and one rectangle using an Eq. 15.
$$\begin{array}{@{}rcl@{}} Q_{AUC}(R) = FPR \times TPR + (1-FPR) \times TPR + \frac{(1-FPR) \times (1-TPR)}{2} \end{array} $$
(15)
TPR (true positive rate) and FPR (false positive rate) are calculated as follows:
$$\begin{array}{@{}rcl@{}} TPR=\frac{TP}{TP+FN}, FPR=\frac{FP}{FP+TN} \end{array} $$
(16)
$$\begin{array}{@{}rcl@{}} Q_{p\_AUC}(R) = TPR + \frac{(1-TPR)}{2} \end{array} $$
(17)
The adjusted version of AUC computing a potentially best AUC that a rule can achieve is shown in Eq. 17. In contrast to Eq. 15, \(Q_{p\_AUC}\) supposes that FPR goes to zero.
Feature construction
In the Problem definition section, we defined the rule space \(\mathcal {R}\) as a quasi-ordered set that is expressed as a pair of a set of rules and the relation ≽r between rules. In addition, the form of rules is determined by propositional logic; more precisely, the rule is restricted to a conjunction of positive terms, i.e.
$$R=t1 \wedge t2 = \{t1,t2\}, t1,t2 \in O.$$
The first step in the rule learning process is feature construction because rule learning employs features as their basic building blocks. In this work, features are constructed trivially from a set of terms T which comes from the ontology O where each ontology term corresponds to one feature.
Feature selection
Oftentimes, a constructed feature set is extremely large and also redundant since it contains many features that are not associated with any example. For this reason, a feature selection method is highly recommended. Given this, we propose three various feature selection methods.
FS_atLeastOne
The first feature selection method excludes such terms from a constructed feature set which are not associated with at least one example from a set E+∪E−. In other words, this feature selection method removes such terms that are highly specific or do not cover any example. This method guarantees that removed terms cannot positively affect the final evaluation score of a rule because these terms cover an empty set of examples. For this reason, if such terms appeared in a rule then the rule would cover an empty set of examples.
FS_onlySig
The second feature selection method preserves only features whose terms are significant. P-values are calculated using a Likelihood Ratio Statistic (LRS) as is presented in [20]. The LRS for the two-class problem measures differences between two distributions: the positive and negative class probability distribution within the set of covered examples and the distribution over the whole example set. It is computed as follows:
$$\begin{array}{@{}rcl@{}} LRS(r) = 2 \times \left(TP \times log_{2} \frac{\frac{TP}{TP+TN}}{\frac{TP+FN}{|E|}} + TN \times log_{2} \frac{\frac{TN}{TP+TN}}{\frac{FP+TN}{|E|}} \right) \end{array} $$
(18)
This measure is distributed approximately as χ2 distribution with 1 degree of freedom for two classes. If the LRS is above the specific significance threshold then the term is considered to be significant.
FS_sigAtLeastOne
The third feature selection method combines the two previous feature selection methods. A term that belongs to the feature set has to satisfy two conditions: 1) that term covers at least one example, and 2) the term is significant which is calculated by the LRS or the term is a generalization of some significant term. This method combines requirements from the previous two selection methods, its selectivity will be experimentally verified later.
Rule construction
Rule construction is the second step which aims to find a rule that optimizes a given quality criterion in the search space of rules.
The description of the algorithm for single rule learning is depicted in Algorithm 2 where input is a set of positive examples E+, a set of negative examples E−, a set of ontologies \(\mathcal {O}\), a function buildMapping that creates a link between the ontology and the set of examples E (E=E+∪E−), and a parameter k that represents the maximal length of induced rules. Note that this function is defined manually by a user. The first step in Algorithm 2 is to find all features. This operation is represented by the function featureConstruction at line 4 that assigns all terms from the set of ontologies \(\mathcal {O}\) to a set of features \(\mathbbm {F}\). To remove irrelevant features from the set of features \(\mathbbm {F}\), we propose a function featureSelection at line 5. Here, three various feature selection methods are provided as we mentioned in the “Feature selection” section, i.e. FS_atLeastOne, FS_onlySig, and FS_sigAtLeastOne.
The main part of this algorithm is presented in lines 8-24. In this while loop, candidate rules are gradually refined until the maximal length of the rule is reached (l variable represents the current length of rule) or there is nothing to refine, i.e. the algorithm did not create any new rule in the previous iteration. In the for loop (lines 11-21), new candidate rules are generated using the application of the refinement operator on the corresponding parental rules. The algorithm iterates over each rule that is presented in the set of rules \(\mathbbm {R}\). To this rule, we apply a new ontology-based refinement operator which is represented at line 12 by the function refineRule that uses the Redundant Generalization and Redundant Non-potential reduction procedures. Similar to the traditional CN2 refinement operator, the ontology-based refinement operator appends a feature to the refined rule. For example, in the case of a conjuction of terms R={t1,t2,t3}, a new rule is created as the union of term t4 and terms in rule R, i.e. R_new={t1,t2,t3}∪{t4}. A new refinement operator requires the following inputs: rule r to refine, a set of features \(\mathbbm {F}\), an ontology \(\mathcal {O}\) for information about relationships, a score of the best rule \(\mathbbm {R}_{BEST\_SCORE}\) that has been discovered, a set of positive and negative examples E, and a mapping M′ that represents a connection between ontologies and examples. The operator returns a set of all refined rules that are not Redundant Generalizations nor Redundant Non-potentials and assigns them to newCandidates set.
The refineRule function that is described in Algorithm 3 starts with an empty set \(\mathbbm {S}\) where a content of this set will be returned at the end of the function at line 10. The cycle from lines 3 to 6 appends every feature to the rule that should be refined. Up to this part, the algorithm is similar to the traditional refinement operator. However, all rules that are not Redundant Generalization are excluded from the set \(\mathbbm {S}\) using the ontology \(\mathcal {O}\) that provides relationships among terms. This is done by calling a function removeRedundGeneralizations at line 8. The function removeRedundNonPotentials removes such rules that satisfy the definition of Redundant Non-potential rules. In this case, the function continuously checks the following: 1) \(R \succeq _{r} \forall s \in \mathbbm {S} \cup R\). This is true since each element s represents a rule that is created as a refinement of rule R. 2) For each s, if its potential quality Qp(s) is less than the quality \(Q(\mathbbm {R}_{BEST})\) then remove s and all its more specific rules from the set \(\mathbbm {S}\). In other words, all rules in \(\mathbbm {S}\) whose potential quality can be greater than the rule with the greatest quality \(\mathbbm {R}_{BEST}\) are preserved.
All candidate rules that were generated in refineRule function are assigned to the set of new rules \(\mathbbm {R}_{new}\). In addition, all newCandidates are evaluated by the function evaluateCandidate and its corresponding quality score is compared to the rule with the highest quality stored in a \(\mathbbm {R}_{BEST\_SCORE}\). If such a compared rule has a better quality then this rule is assigned to the \(\mathbbm {R}_{BEST}\) variable and the score is stored in the \(\mathbbm {R}_{BEST\_SCORE}\) variable. Simultaneously, the rule has to be significant. To compute this significance, we use LRS as we did in feature selection.
At the end of the algorithm, the best rule of the all rules that have been discovered is returned. If the function filterRules at line 22 is omitted then the Algorithm 2 is called a brute-force exhaustive search that explores the whole search space and leads to a combinatorial explosion. For this reason, an appropriate heuristics should be provided for reducing the search space. In this work, we use Beam search that expands only the most promising rules based on the evaluation function. Other rules are disregarded.
