RASMA: A Reverse Search Algorithm for Mining Frequent Subgraphs

Background: Mining frequent co-expression networks enables the discovery of interesting network motifs that elucidate important interactions among genes. Such interaction subnetworks have been shown to enhance the discovery of biological modules and subnetwork signatures for gene expression and disease classiﬁcation. Results: We propose a reverse search algorithm for mining frequent and maximal subgraphs over a collection of graphs. We develop an approach for enumerating connected edge-induced subgraphs of an undirected graph by using a reverse-search algorithm, and then use this enumeration strategy for mining all maximal frequent subgraphs. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for maximal subgraphs, the proposed algorithm employs several pruning strategies, which substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm eﬃciently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are more frequent are more likely to be enriched with biological ontologies.


Background
Advances in genome technologies allows for the probing of thousands of genes at the same time through the use of mRNA sequencing and gene expression microarray. Gene expression analysis aims at discovering gene clusters that have similar expression profiles and dysregulated genes that can be used as markers for solving various disease classification tasks. However, research has revealed that genes do not work in isolation and often a single gene does not have an independent effect on a phenotype, rather multiple genes interact together to control that phenotype. To capture such correlation among genes, gene coexpression networks have been proposed [1]. In coexpression networks analysis, a gene expression dataset is represented as a coexpression network where nodes represent genes and a link exist between a pair of genes if they exhibit significant correlation in the microarray analysis [2,3]. Traditionally a single gene expression dataset was analyzed independently. However, functional annotation and biological inference based on a single gene coexpression dataset has limitations due to experimental noise [2]. Learning from multiple gene expression datasets alleviates the noise problem. So, recent research has focused on mining biologically interesting gene coexpression subneworks from multiple heterogeneous gene expression datasets. A set of genes that have similar expression profiles in multiple experiments is more likely to represent a biological module [1,2]. The integrative analysis of multiple gene expression datasets enables the discovery of significant interactions involved in complex biological processes, and has been employed for functional annotation [1], active module discovery [4], and biomarker discovery [5]. An approach to identify these co-expression subnetworks is to mine frequent subgraphs over multiple gene expression networks. Careful study of these frequent subgraphs can lead to the identification of functional modules and the discovery of significant genes interactions playing key roles in complex diseases [6].
Existing algorithms for mining significant patterns from coexpression networks mainly follow network clustering, pattern enumeration approaches, or a combination of both. Lee et. al. [1] proposed an approach that constructs an aggregate summary network by combining all the gene coexpression networks and removing coexpression links that appear in a small number of networks. Dense subnetworks are then extracted from the aggregate summary networks using a traditional module finding algorithm. Mining dense modules from the summary network results in false positive modules that are highly dense in the summary network but not dense in the individual networks as the links are scattered across the networks.
Other approaches followed a two-step approach for mining modules from multiple networks. Authors of [2] proposed CODENSE algorithm that mines dense modules from the aggregate network and then uses the similarity between edges' occurrences to extract dense coherent modules. The output of the CODENSE algorithm is dense modules whose edges appear in similar graphs. [7] proposed an approach that first clusters edges that appear in similar graphs. The subnetwork induced by this edge cluster can be disconnected and sparse and thus the subnetwork is then partitioned into dense subnetworks. Since the edges in the cluster appear in similar sets of graphs, the dense subnetworks extracted from the edge cluster are coherent. In an earlier work, we proposed a similar approach that first mines maximal frequent edgesets and then extracts k-cliques and percolated k-cliques from these edgesets in the aggregate network [8]. Other approaches have been proposed to extract modules from a summary graph constructed by combining the edge topological similarity in the aggregate graph and edge occurrence similarity [9,10].
A key limitation of all two step methods is the appearance of false-positive or falsenegative edges in the network motifs; so frequent subgraph based algorithms are increasingly being used for mining co-expression networks. However, co-expression networks are generally large, sometimes having tens of thousands vertices, and traditional frequent subgraph mining algorithms [11,12,13,14], which solve the subgraph isomorphism exactly are computationally prohibitive for mining from a collection of co-expression networks. However, a key observation that the vertex labels of a co-expression network are distinct, as each vertex represents a distinct gene, leads to efficient subgraph mining algorithms for co-expression networks-this has been first exploited by Koyuturk et al [15]. A major challenge for subgraph enumeration algorithms is the development of efficient traversal of the frequent subgraphs enumeration tree, and avoiding the generation of duplicate frequent subgraphs. A reverse search algorithm defines a set of rules for generating frequent subgraphs from a parent frequent subgraph. These rules are employed for constructing the subgraph enumeration tree starting from single edges and following a depth-first pattern growth approach. In this paper, we propose a novel reverse search algorithm for enumerating all edgeinduced connected subgraphs of a graph. The reverse search utilizes the shortest distance between edges to check for valid subgraph extensions. Building on this enumeration approach, we propose an algorithm for mining all frequent and maximal frequent subgraphs from a graph database. To efficiently mine all maximal frequent subgraphs, we propose two pruning techniques that eliminate futile search subtrees in the frequent subgraph enumeration tree. These pruning strategies result in significant improvement in the running time of the algorithm. We demonstrate the effectiveness of the proposed algorithms with the pruning strategies on gene coexpression graphs, and show that the proposed algorithm is orders-of-magnitude faster than existing algorithms.

Related Work
Several algorithms have been proposed for mining frequent subgraphs from nonuniquely labeled graphs where multiple nodes in the same graph can be assigned the same label, e.g., atom labels in chemical compounds. These mining algorithms include mining all frequent subgraphs from multiple graphs [11,12,13,14] and algorithms for mining a summary of all frequent subgraphs such as mining closed frequent subgraphs [16] and maximal frequent subgraphs [17]. Mining the complete set of frequent subgraphs is computationally intractable for large graph sets and small support thresholds and sampling methods for mining representative subgraphs have been proposed [18]. Mining subgraphs with certain topological properties has received attention and several algorithms have been proposed for mining subgraphs that are highly-connected in multiple graphs [19,20,21].
A special class of graphs is the graphs with unique-label nodes, e.g., gene coexpression networks, where no two nodes in the same graph have the same label. The problem of mining frequent subgraphs from graphs with unique vertex labels has received less attention. Though possible, it is not computationally efficient to employ existing subgraph mining algorithms designed for the general case to mine uniquely-labeled graphs. The computationally-intensive procedures of subgraph and graph isomorphisim are not required for mining uniquely-labeled graphs. Moreover, the tasks of candidate generation and pruning is much simpler for uniquely-labeled graphs.
The backbone of mining subgraph algorithms is the enumeration strategy employed for enumerating all connected subgraphs as potentially all connected subgraphs could be frequent. Frequency and topological constraints are then enforced while enumerating the subgraphs. In sparse graphs, the number of connected subgraphs is much smaller than the number of all subgraphs. Moreover, the number of subgraphs that satisfy the frequency or topological constraints is much smaller than the number of connected subgraphs.
Koyuturk et. al. [15] proposed the MULE (Mining Uniquely Labeled Edgesets) algorithm for mining frequent subgraphs. Moreover, an extension to the MULE algorithm was proposed to mine the closed and maximal subgraphs. At the core of the MULE algorithm is an enumeration approach for visiting all connected edgeinduced subgraphs of a graph. The enumeration approach in the MULE algorithm visits each subgraph in the enumeration tree only once. It does so by defining the set of candidates to be explored at each step based on the set of edges visited and the current pattern. In the MULE algorithm, at each search node in the search space, the set of subgraphs generated from a given subgraph is not always the set of all supergraphs of a given subgraph because the missing supergraphs would be visited from other subgraphs. Because the frequency constraint satisfies the downward closure (Apriori) property, the minimum support constraint is enforced while traversing the subgraph lattice and a futile search branch is pruned once an infrequent subgraph is encountered. The number of frequent subgraphs in a graph dataset is very large, especially for small support thresholds. This is mainly because all the subgraphs of a frequent subgraph are also frequent (downward closure property). For downstream analysis of these frequent subgraphs, it is often desired to mine a representative set of these frequent subgraphs. Moreover, it can be computationally feasible to mine these representative patterns when mining all frequent subgraphs is not possible.
Several approaches have been proposed to mine succinct set of frequent subgraphs, including maximal frequent and close frequent subgraphs [22] . A frequent subgraph that does not have any frequent supergraph is a maximal frequent subgraph. For maximal frequent subgraphs, if a frequent subgraph does not have any frequent supergraph in the enumeration tree, then it is locally maximal frequent. The MULE algorithm checks if the the locally maximal subgraph is a subgraph of an already mined maximal frequent subgraph to ensure that it is a maximal subgraph. The set of discovered maximal subgraphs that has to be kept in memory can be very large and thus checking if a subgraph is a subgraph of an already discovered maximal subgraph can be computationally expensive. Another limitation of the MULE algorithm is that it does not have pruning strategies that eliminate the traversal of search branches that would result in locally maximal subgraphs that are not globally maximal frequent subgraphs. For the special case when the graph dataset has a single graph, and minimum support of 1, the MULE algorithm enumerates all connected edge-induced subgraphs of the single graph while in fact there is only one maximal subgraph.
Another approach for enumerating all connected subgraphs was proposed in [23]. The main idea of the approach is that for a given edge, the set of all connected subgraphs can be partitioned into two groups: the subgraphs that have the edge, and the subgraphs that do not have the edge. The recursive algorithm has an amortized computation time of O(1) for each subgraph.
Reverse Search is a recent search approach for enumeration problems [24]. The basic idea of reverse search is to arrange all objects to be enumerated in a tree, where each search node has a unique parent node. A major task of a reverse search algorithm is the definition of a parent operation on the sets being enumerated that reduces a node in the tree to its unique parent node [25]. All the objects to be enumerated form an enumeration tree with tree nodes representing edges and the connections between objects and the corresponding parent are represented by edges. A child operation, defined by inverting the parent operation, determines if an object is a valid child of a given parent object. The enumeration tree is constructed by applying a depth-first traversal, starting from a canonical root and employing the child operation to generate objects. In reverse search at each object, the algorithm identifies a superset of all children of the object, tests the parent operation to confirm whether a member object is a valid child of the parent object, and recursively searches each child [25]. Several reverse search-based algorithms have been proposed for solving traditional enumeration problems, including all induced connected subgraphs, all spanning trees of a graph, all topological orderings of an acyclic graph, all dense subgraphs of a graph, and all maximal independent sets of a graph [24,25].
A reverse search algorithm, RS-MST, for enumerating all vertex-induced connected subgraphs has been introduced in [24] where the parent subgraph of a subgraph G is obtained by removing the vertex with the minimum degree in the spanning tree of G. A subgraph resulting from extending a subgraph G with a vertex v is a valid child of G if vertex v is a vertex with the minimum degree in the subgraph formed by adding the vertex v to the subgraph G. A similar approach can be applied for mining all edge-induced subgraphs. For an undirected graph G(V, E) and a connected edge set is an edge connecting to a leaf node in the minimum spanning tree of the subgraph formed by extending G[E s ] with the edge e, i.e., G[E s ∪ {e}]. For these two reverse search algorithms, finding the MST to check for valid subgraph extension is a costly operation, considering that some extensions (invalid ones) will not be pursued in constructing the enumeration tree and will not be reported.
In [26], we proposed a reverse search algorithm for enumerating all vertex-induced connected subgraphs of a graph. The parent operation is based on the shortest distance of the newly added vertex to the anchor vertex of the subgraph. The algorithm outperformed existing methods for vertex-induced subgraph enumeration. Moreover, we employed the enumeration approach to mine all maximal cohesive subgraphs from vertex-attributed graphs. Pruning strategies were proposed to prune branches of the enumeration tree that will not result in maximal cohesive subgraphs. The proposed method takes an edge-growth approach to mine all connected frequent edgesets.

Methods
The backbone of the proposed frequent subgraphs mining algorithm is an approach to enumerate all connected edge-induced subgraphs of a single graph. We first explain our enumeration approach for all connected edge-induced subgraphs and then extend this approach to mine all frequent and maximal connected subgraphs.

Enumerating all edge-induced Subgraphs
Let G = (V, E) be an undirected graph, where V = {v 1 , ..., v n } denotes the set of vertices, and E ⊆ V × V is the set of edges. For a vertex v i ∈ V , i is a unique identifier of that vertex, which is fixed but arbitrarily assigned. For an edge set denotes the subgraph of G induced by E ′ , whose nodes, V ′ include all the endpoints of the edges in E ′ . We call E ′ a connected edge set if its corresponding edge-induced subgraph G[E ′ ] is connected. A connected edge-induced subgraph can be uniquely identified by its corresponding connected edge set and thus the two terms are used interchangeably. To maintain an edge ordering, an edge between vertices v i and v j is denoted as (i, j) where i < j. We define a total order relation on the set of edges in the graph such that (i, j) (k, l) if i < k or i equals k and j ≤ l. The distance between two edges, denoted d(e i , e j ), in a connected graph is the number of non-terminal vertices in a shortest path between the edges. Using this definition, adjacent edges that share an endpoint have a distance of 1; For an edge e, the set of all adjacent edges of e is referred to as the neighborhood of e, and is denoted as N (e). The neighborhood of e is defined as the set of edges with a distance of 1 to e. N (e) = {e i ∈ E, d(e i , e) = 1}. For an edgeset S ⊆ E, the set of neighboring edges in a graph G(V, E), denoted as N (S), contains the set of edges not in S that have at least one neighboring edge in S.

CEIS(G) = {S | S ⊆ E and G[S] is connected}
In this paper, we propose a reverse search algorithm for enumerating all connected edge sets of an undirected graph.

Search Graph
For a single connected graph, the enumeration of the set of connected edgesets can be represented by a directed search graph in which nodes represent connected edgesets and there is a directed edge between two edgesets, (X, Y ) if Y = X ∪ {e} and the deletion of e from Y keeps X connected. In the search graph, a search node (say, Y ) can have multiple incoming edges as multiple connected edgesets can lead to the same connected edgeset. A naive approach to traverse the entire set of all connected edgesets is to grow an edgeset by extending it with one of its neighbor edges and checking in a global list whether the edgeset has been enumerated before to avoid duplicate listings. Given the combinatorial nature of connected edgesets, this approach is inefficient as it enumerates each edgeset many times, and the number of distinct edgesets grows exponentially with the size of the graph.

Reverse Search
The algorithm builds and traverses the connected subgraphs search tree wherein nodes in the tree correspond to connected subgraphs and arcs correspond to the parent-children relations between these subgraphs. The arcs in the search tree are defined by a neighborhood function that defines a set of search nodes that can be generated from a search node; this set is referred to as the valid children of a search node. Each edgeset appears only once in the search tree, and there is only one incoming link to each edgeset from its parent edgeset. We enumerate the set of connected edgesets by depth-first traversal of the search tree. In this section, we define the parent operation and a data structure that allows for efficient parent/child operations.

Parent child relationship
A search node Y corresponding to a connected edgeset can be obtained from a unique search node (say X). Then X is called the parent node and Y is called the child node. The edgeset X can be obtained by deleting a specific edge from the edgeset Y .
Lemma 0.1 For a connected edgeset U , let s be the smallest edge in U , denoted as anchor(U ) and e ∈ U is one of the edges with the longest shortest path from s, then G[U − e] is also connected.
Proof We will prove this claim by contradiction. Say, for a connected edgeset U , e is the edge with the longest shortest distance from s and for contradiction, let's assume that deleting e results in a disconnected graph. This means that there exists at least an edge e ′ such that all shortest paths between s and e ′ go through e. Let p se ′ = s, · · · , e, · · · , e ′ be a shortest path from s to e ′ and w(p se ′ ) denote the length of the path. So, the shortest distance between s and e ′ , w(p se ) + w(p ee ′ ), is greater than the shortest distance between s and e, i.e., w(p se ′ ) > w(p se ). This contradicts our assumption that e is the edge with the longest shortest path distance from s in U . Thus, G[U − e] is connected.
For defining the child/parent relation as stated above, we need to designate an edge of the subgraph as the anchor edge. We denote the smallest edge s ∈ U as anchor(U ), i.e., anchor(U ) = s such that s e i , ∀e i ∈ U \ s. Moreover, let e = utmost(U ) denote the edge in U with the longest distance to the anchor edge s = anchor(U ). If there are more than one edge whose distance equals the longest distance, we take the smallest edge according to the order relation. For a connected edgeset U with anchor s, the utmost edge is defined as follows: utmost(U ) = e such that e ∈ U \ s and ∀e i ∈ U \ e either d(e i , s) < d(e, s) or d(e i , s) = d(e, s) and e i e, .

Valid Children
Building on lemma 0.1, in the search tree, we can expand a parent node X to construct one of its child node Y . In a given graph G(V, E), the search tree node X corresponds to the connected edgeset U with s = anchor(U ), e = utmost(U ). Now, e ′ ∈ E \ U is a neighbor edge of U , and also s e ′ . Then, Y is a child of X corresponding to the connected edgeset, U * = U ∪ {e ′ } if and only if the following condition holds: 1 The distance from s to e ′ is greater than the distance from s to e, or This above definition of validchild ensures that the newly added edge to the child node has the longest distance from the anchor, and if multiple edges have the same longest distance to the anchor, the newly added edge is the largest considering the order relation among the edges. The proposed reverse search parent-child relation is the backbone of the enumeration tree neighborhood function, N : CEIS(G) → 2 CEIS(G) . The parent-child relation guarantees that a search node only appears once in the range of the neighborhood function.
If U and U ∪ {e ′ } are edgesets corresponding to a parent and a child node, respectively, we call e ′ a valid candidate of U , otherwise, we call it an invalid candidate of U . For an edgeset U , the set of neighboring edges constitute the candidate edges (valid and invalid). Figure 1 (a) shows a sample graph, and figure 1 (b) shows the enumeration tree of the set of all connected edgesets of this graph. Edges are uniquely labeled starting from 1. The edges of an edgeset are written inside the oval shape and the set of candidate edges are written adjacent to the oval shape. Figure 1 (b) shows that edgeset U = {1, 3} has {2, 4, 5} as the candidate edges; anchor(U ) = 1 and utmost(U ) = 3. Edge 2 is not a valid candidate because its distance to 1 is the same as the distance of the utmost edge 3 to edge 1 but 2 is less than 3 in the order of the edges, thus the branch corresponding to {1, 3, 2} will not be explored. Edge 4 has the same distance as edge 3 to edge 1, but since 4 is greater than 3, then edge 4 is a valid candidate. Edge 5 distance to edge 1 is larger than the distance of the utmost edge and thus it is a valid candidate. For the edgeset {2, 5} with a candidate set {1, 3, 4}, both edges 1 and 3 are not valid candidates; edge 1 is less than the anchor 2 and edge 3 has the same distance as edge 5 to edge 2 but edge 3 is less than 5; edge 4 is a valid candidate because its distance to edge 2 is larger than the distance of the utmost edge 5. For single-edge search nodes in level 1, if the candidate edge is larger than the anchor edge, then it is an invalid edge.
Enumerating all subgraphs of a single graph Algorithm 1 shows the pseudo-code for our algorithm. For each edge in the graph, we call EnumerateCEIS, a recursive procedure; the set of neighbors for each edge constitute the candidate edges. The procedure takes a connected edgeset U , the set of candidate edges, and the last edge added to the edgeset. For each edge e j in the candidate set (line 7), the procedure checks if the edge is a valid candidate (line 8) for extending U . If so, it updates the candidate set and recursively calls the EnumerateCEIS procedure (lines 9-10). The candidate set can be updated by using the current candidate set and the neighbors of the last added edge N (e j ).
The isValidExtension procedure (line 15) checks if the edge e j is a valid candidate for the edgeset in E s following the rules in the valid children section.

Complexity Analysis
Since the number of reported patterns can be exponential in the number of edges of the graph, we analyze the time the algorithm takes to report a pattern after it has generated the previous pattern. For a polynomial delay algorithm, it takes polynomial time to output an element after generating the previous element [24]. We use an array-based implementation in which we maintain the set of edges of an edgeset, the candidate edges and the distance of the candidate edges to the anchor edge. Using this data structure, the anchor edge, utmost edges, and the distance of an edge to the anchor edge can be accessed in constant time. The algorithm checks if an edge is a valid candidate of the edgeset in a constant time O(1) (Algorithm 1 line 8). If the edge is a valid candidate, then the candidate set can be updated in O(|N (e j )|) time for the last added edge e j . Once a valid candidate is encountered, the recursive procedure is called. Therefore, each connected edgeset can be enumerated with linear delay O(|E|).
An algorithm is output polynomial, if it outputs all the elements to be enumerated in time polynomial to the number of elements. Since the proposed algorithm takes linear time for each connected edgeset, it is output (or total) polynomial in the number of connected edgesets.
The algorithm explores the search tree in a depth first manner, which ensures that the space used is bounded by the depth of the search tree, which is at most |E|. We use three arrays, each of size |E| to keep track of which edges are in the connected edgeset, their neighbors, and their distances to the anchor edge. So, the depth first search of the enumeration tree can be done with linear space in the depth of the enumeration tree which is O(|E|).

Mining Frequent Subgraphs
In many applications, we have a dataset of graphs and the goal is to extract significant subgraphs. In the frequent subgraph mining problem, the goal is to mine subgraphs that appear in at least a user-defined minimum threshold of the graphs. In this work, we are only concerned with connected frequent subgraphs.
Graph Dataset Let G = {G 1 , G 2 , · · · , G n } denote a set of n undirected graphs. An undirected graph G i = (V, E i ) is a tuple where V = {v 1 , v 1 , · · · , v k } denote the set of vertices, and E i ⊆ V × V denote the set of edges of the corresponding graphs. All the graphs are defined over the same set of vertices; for ej ∈ C do 8: if isValidExtension(Es, ej, e l ) then if ej < s then 18: return F alse 19: end if 20: if distance(ej, s) > distance(x, s) then 21: return T rue 22: end if 23: return distance(ej, s) = distance(x, s) and ej > x 24: end function

Algorithm 1 Mining All Connected Edge-Induced Subgraphs
In this work, we represent the dataset G of n graphs as an edge-attributed graph, G(V, E, f ), where V is the set of vertices and E is the set of all the edges in the graph dataset and an edge attribute function. The edge attribute function maps each edge to the set of graphs in which it appears. The set of all edges is the union of the sets of edges in each graph.
where E i is the set of edges in G i ∈ G. We label the edges in the edge-attributed graph with unique identifiers {1, 2, · · · , |E|}. Figure 2 shows a toy graph dataset of four graphs in (a) and the corresponding edge-attributed graph in (b). G}. When the graph dataset is clear in the context, we refer to the supporting graphs as sup(G s ). The cardinality of the supporting graphs is referred to as the support of the subgraph, i.e., |sup(G s )|.

Mining Frequent Subgraphs
Frequent Subgraph Given a graph dataset G and user-specified support threshold S min , a graph G s is called frequent if the subgraph's support is equals to or greater than the support threshold, i.e., G s is a frequent subgraph if |sup(G, G s )| ≥ S min .
Since an edge-induced subgraph is uniquely identified by the edgeset, we use frequent subgraphs and frequent edgesets interchangeably.
Problem Definition: Given a graph dataset G and a support threshold S min , the problem of mining the set of frequent subgraph is to enumerate the set: such that every G si ∈ F is a frequent connected subgraph, i.e., sup(G si ) ≥ S min . For the graph dataset in Figure 2(a), the set of frequent subgraph for minimum support of 3 is shown in Figure 2(c). Given a minimum support threshold S min , the anti-monotone support constraint guarantees that if a subgraph G s is frequent, then each subgraph of the subgraph is also frequent, i.e., |sup(G s )| ≥ S min =⇒ for all G * ⊂ G s , the subgraph is frequent |sup(G * )| ≥ S min .
Our proposed algorithm for mining all frequent subgraphs problem employs the reverse search enumeration approach in algorithm 1 to enumerate all connected subgraph and enforcing the supporting constraint. The anti-monotone property of the support of a subgraph is employed in the mining algorithm to prune search branches when an infrequent subgraph is encountered. If an infrequent subgraph is encountered, then the recursion procedure EnumerateFCIS is not called and the search subtree rooted at this infrequent subgraph is not enumerated. The enumeration tree for the set of frequent subgraphs is shown in Figure 3(b).
The algorithm for mining frequent subgraphs is shown in Algorithm 2. In line 1, infrequent edges are pruned, and the recursive EnumerateFCIS procedure is called for each frequent edge (Line 3). The recursive procedure follows the same steps as the enumeration approach in algorithm 1, except for the if statement in line 9 to ensure that only search branches rooted at frequent subgraphs are explored. The recursive procedure is called only from frequent children (line 11). Therefore, only frequent subgraphs will be added to the set of frequent subgraphs in line 6.

Algorithm 2 Mining All Connected Frequent Subgraphs
Input: a graph dataset, G , and a minimum support threshold, Smin Output: F , all frequent subgraphs for e ∈ C do 8: if isValidExtension(Es, e, e l ) then 9: if |sup(Es)| ≥ δ then

Mining Maximal Frequent Subgraphs
Because of the downward closure property of frequent subgraphs where all the subgraphs of a frequent subgraphs are frequent, there is high overlap between frequent subgraphs. A representative set of all frequent subgraphs is a concise summarization of the frequent patterns in the dataset. We thus propose an algorithm for mining maximal frequent subgraphs. A maximal frequent subgraph is a frequent subgraph that does not have any frequent supergraph. i.e., G(E s ) is maximal frequent if there is no subgraph G(E * ) ⊃ G(E s ), such that sup(G(E * )) ≥ S min . All frequent subgraphs can be extracted from the set of maximal frequent subgraphs since all subgraphs of a maximal frequent subgraph are frequent. However, the exact frequency (support) of the frequent subgraphs can not be obtained from the maximal subgraphs. Due to the combinatorial nature of frequent subgraphs, the set of maximal frequent subgraphs is much smaller than the set of all frequent subgraphs.
Problem Definition: Given a graph dataset G and a support threshold S min , the problem of mining the set of maximal frequent subgraph is to enumerate the set: such that every G si ∈ M is a maximal frequent connected subgraph.
For the graph dataset of four graphs shown in figure 2(a), and minimum support S min = 3, there are two maximal frequent subgraphs and they are drawn inside dotted circle in figure 2(c).
In the enumeration tree for mining frequent subgraphs, every leaf search node is potentially a maximal frequent subgraph. The reason for a leaf not being a maximal frequent subgraph is that there could be an invalid subgraph of that leaf that is frequent and it was not explored because it is not a valid extension at this stage of the enumeration tree. An algorithm for mining all maximal frequent subgraphs is to enumerate the frequent subgraphs enumeration tree and to report subgraphs that do not have any frequent valid or invalid extension. This algorithm is a straightforward extension of Algorithm 2. Following this mining approach, the enumeration tree for maximal frequent subgraphs would look like the tree in Figure 3(b). We will need to enumerate all 20 frequent subgraphs to get the two maximal subgraphs. Enumerating the search tree of frequent subgraphs is computationally expensive, especially for low minimum support thresholds when the search tree becomes very large. A more efficient approach would be to mine the set of all maximal frequent subgraphs without enumerating the whole frequent subgraphs enumeration tree. In the following subsections, we develop pruning strategies that eliminate the need to traverse search branches without sacrificing the completeness of the results. In the experiments section, we demonstrate how the proposed pruning strategies result in a significant performance improvement.

Consumed by a sibling
In a given graph G(V, E) and a connected frequent edgeset S ⊆ E and the corresponding edge-induced frequent subgraph G[S], let e i and e j be two valid candidate edges of G[S] such that e i is closer to anchor(S) than e j (e i ≺ S e j ), and these two extensions generate two frequent subgraphs, G

Level One Pruning
Pruning for level one (single edge) is similar to pruning at any search node in the search tree. For any two edgesets, e i and e j sharing a common endpoint and e i is smaller than e j in the order of edges, if the supporting graph of e j is a subset of the supprting graphs of e i (sup(e j ) ⊆ sup(e i )), then the search tree rooted at e j can be safely pruned. The proof follows the same steps as in lemma 0.2. For an edgeset In Figure 3, the search subtree rooted at (A, C) is pruned because the set of supporting graphs of (A, B) is a superset of the supporting graphs of (A, C). Similarly the three search subtrees rooted at (A, D), (B, C), and (B, D) are all safely pruned.

Algorithm
Algorithm 3 shows the pseudo code for the proposed algorithm. The algorithm follows the enumeration approach for mining frequent subgraphs and employs the pruning strategies to avoid visiting subtree branches that will not result in maximal frequent subgraphs. In line 1, frequent edges are extracted and then in lines 3-7, a search subtree will be expanded from each frequent edge. Frequent edges that are covered by a neighboring smaller edges will not be explored (line 4). The neighbors of edge e constitute the candidate edges. In the MineMaximalSubgraph procuedure, for each edge in the candidate edges C, if the extension would generate a frequent subgraph, then we set the maximal flag to false indicating that the current subgraph is not a maximal frequent subgraph. Next, for each valid candidate we check if this extension is covered by a previous extension according to the two pruning strategies and we recursively call the procedure only for valid children that are not covered (lines [14][15][16][17][18][19]. We add the current pattern to the maximal frequent subgraphs set if the maximal flag is still true, line 24. maximal ← true 10: for

Results
We tested the performance of the proposed algorithm on mining frequent and maximal subgraphs from gene coexpression networks. Moreover, for investigating the impact of the pruning techniques, we compared the running time of the algorithm with and without the pruning techniques. All experiments were performed on a machine with Intel Xeon 2.40GHz processor with 16 Gbytes main memory, running the Linux operating system. The algorithms were implemented in C++ and the MULE implementation was in C.

Performance on real data
We tested the proposed algorithm on 35 tissue gene coexpression networks constructed by the Gene Genetic Network Analysis Tool [27]. The coexpression net-works were inferred from Genotype-Tissue Expression (GTEx) data [1] . Each coexpression network is constructed from the gene expression of non-diseased tissue samples. On average there are 14, 415 coexpression links (edges) in each network among 9, 998 genes. In total, there are 1, 548, 622 unique coexpression edges that appear in at least one coexpression network. Among these edge, there are 4, 127 edges that appear in at least 20 networks, and on average each edge appears in 3.28 networks. Table 1 shows how the number of frequent and maximal subgraphs (|F| and |M|) and the running times for the MULE and RASMA for mining the frequent and maximal subgraphs vary for varying minimum support thresholds. For mining the maximal subgraphs, the proposed algorithm is orders of magnitude faster than the MULE algorithm for low support thresholds. The MULE algorithm is much slower for mining maximal frequent subgraphs since it has to enumerate the same frequent subgraphs enumeration tree. Moreover, for each potential maximal subgraph the MULE algorithm checks if it has a supergraph in a global list. For mining all the frequent subgraphs, both algorithms have similar running times and for a support threshold of 15 both did not finish the mining task in two days. Table 2 shows the topological properties of the reported subgraphs and running times of RASMA for lower support thresholds. The number of maximal subgraphs (|M * |) increases for lower support and so do the average numbers of edges (|E|), nodes (|V |), and density (Density). For calculating the topological properties of the maximal frequent subgraphs, only subgraphs with at least three edges (denoted |M * |) are considered since a large percentage of the maximal frequent subgraphs have one or two edges only.

Effectiveness of Pruning Techniques
The pruning techniques aim at reducing the number of search nodes explored to report the maximal patterns mining. The closer to the root the pruning occurs, the more search nodes are eliminated. To investigate how the proposed pruning techniques improve the performance of the algorithm, we show a comparison with the pruning techniques disabled. Table 4 shows the impact of the pruning techniques on the running time and on the number of frequent subgraphs explored. Although both the frequent subgraphs and maximal subgraphs (without pruning) algorithms enumerate the same frequent subgraphs search tree, it is important to notice that the algorithm without pruning is much slower than mining all frequent subgraphs. This is true since for a maximal subgraph, all immediate potential children nodes have to be check for frequency to mark the subgraph as maximal (line 12 in algorithm 3), regardless of whether the extension is a valid child. Therefore, the number of frequency checking is much larger than the number of frequent search nodes explored. However, the algorithm for frequent subgraphs checks if an extension is frequent only for the valid children (lines 8 − 9 in algorithm 2). The algorithm with pruning strategies traverses only a very small fraction (0.000028 for support = 16) of the frequent subgraphs while maintaining the completeness of the maximal frequent subgraphs.

Analysis of Maximal Frequent Subgraphs
We performed a biological enrichment for the gene sets (nodes) of the maximal frequent subgraphs. A biological annotation, knowledge, is to said to be enriched in a gene set if a significant subset of the genes of the gene set are annotated with the given annotation. We tested the overrepresentation of genes with a specific annotation in a gene set using the hybergeometric test (with pvalue = 0.01). We used multiple annotation databases from the Molecular Signatures Database (MSigDB) [28,29] for assessing the enrichment of the genes in these reported patterns with these annotations.  Table 3 shows the percentage of subgraphs that are biologically enriched for each of the three biological signatures. Some patterns are enriched with several signatures and some signatures are enriched in the genes of multiple subgraphs. Moreover, the reported subgraphs are enriched with a large number of biological annotations for each of the databases. Only maximal frequent subgraphs with at least three edges were considered for the analysis (denoted as |M * | in the table). The enrichment analysis shows that frequent subgraphs are highly enriched with known biological annotations.
An annotation can be enriched in many reported gene sets. We sorted the annotations by the number of subgraphs they are enriched in. Table 5 shows the top biological signatures that were enriched the most in the reported patterns for S min = 20.

Conclusion
Frequent coexpression subnetworks have been shown to be effective in functional annotation and subnetwork biomarker discovery. We proposed a reverse search algorithm for mining maximal frequent subgraphs. We first proposed a reverse search strategy for enumerating all edge-induced subgraphs from a single graph. The enumeration approach is then employed for mining frequent and maximal subgraphs. To eliminate search branches that will not result in maximal frequent patterns, we proposed pruning strategies that employ the order in which branches are enumerated. The pruning strategies are possible because the reverse search enforces strict definition on the order in which search nodes are enumerated. Experiments on gene coexpression datasets demonstrate the effectiveness of the proposed approaches. The proposed approach is thousands of times faster than the existing algorithm. Enrichment analysis of the genesets in the maximal frequent subgraphs reveal that maximal frequent coexpression subnetworks are enriched with known biological annotations.

Not applicable
Ethics approval and consent to participate Not applicable

Consent for publication Not applicable
Availability of data and material The dataset and the implementation of the algorithm is available at http://www.cs.ndsu.nodak.edu/~ssalem/multirelation.html.