RASMA: a reverse search algorithm for mining maximal frequent subgraphs

Background Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.


Background
Advances in genome technologies allows for the probing of thousands of genes at the same time through the use of mRNA sequencing and gene expression microarray. Gene expression analysis on such microarray data is then used for discovering gene clusters that have similar expression profiles. Such analysis can also be used for obtaining dysregulated genes that can be used as markers for solving various disease classification tasks.
However, research has revealed that genes do not work in isolation and often a single gene does not have an independent effect on a phenotype, rather multiple genes interact together to control that phenotype. Gene coexpression networks can be used to capture such correlation among genes [1]. Given a gene expression dataset, a coexpression network is built in which the nodes represent genes and a link exists between a pair of genes if the corresponding genes exhibit significant correlation in the microarray analysis [2,3]. Traditionally gene expression datasets are analyzed independently. However, functional annotation and biological inference based on a single gene coexpression dataset has limitations due to experimental noise [2]. To alleviate experimental noise, multiple gene expression datasets can be analyzed concurrently in a single study. So, recent research has focused on mining biologically interesting gene coexpression subneworks from multiple heterogeneous gene expression datasets.
A set of genes that have similar expression profiles in multiple experiments is more likely to represent a biological module [1,2]. The integrative analysis of multiple gene expression datasets enables the discovery of significant interactions involved in complex biological processes, and has been employed for functional annotation [1], active module discovery [4], and biomarker discovery [5]. An approach to identify these coexpression subnetworks is to mine significant subgraphs over multiple gene expression networks. Careful study of these significant subgraphs can lead to the identification of functional modules and the discovery of interesting genes interactions that play key roles in complex diseases [6].
Existing algorithms for mining significant subgraphs from coexpression networks mainly follow network clustering [1], approximate and frequent subgraph enumeration approaches [2,7], or a combination of both. A subgraph that appears in at least a userdefined threshold of the graphs is called a frequent subgraph. A frequent subgraph that is not a subgraph of any larger frequent subgraph is called a maximal frequent subgraph. Mining all frequent and maximal frequent subgraphs is challenging as coexpression networks are generally large, sometimes having tens of thousands vertices. On such large graphs, various algorithmic steps of traditional frequent subgraph mining algorithms [8][9][10], such as, candidate generation and pruning, graph and subgraph isomorphism are not efficient.
A special class of graphs is the graphs with unique-label nodes, e.g., gene coexpression networks, where no two nodes in the same graph have the same label. For such networks, the computationally-intensive procedures of subgraph and graph isomorphism are not required for mining uniquely-labeled graphs. Moreover, the tasks of candidate generation and pruning is much simpler for uniquely-labeled graphs. The problem of mining frequent subgraphs from graphs with unique vertex labels has received less attention. One of the early algorithms for mining frequent subgraphs from graphs with unique labels is MULE (Mining Uniquely Labeled Edgesets) by Koyuturk et. al. [11]. In the experiments section, we compare our proposed algorithm with MULE.
In this paper, we propose a novel reverse search algorithm for enumerating all edgeinduced connected subgraphs of a graph. The reverse search utilizes the shortest distance between edges to check for valid subgraph extensions. Building on this enumeration approach, we propose an algorithm for mining all frequent and maximal frequent subgraphs from a graph database, in which the vertices of each graph has a distinct label. To efficiently mine all maximal frequent subgraphs, we propose two pruning rules that eliminate futile search subtrees in the frequent subgraph enumeration tree. These pruning strategies result in significant improvement in the running time of the algorithm. We demonstrate the effectiveness of the proposed algorithms with the pruning strategies on gene coexpression graphs, and show that the proposed algorithm is orders-of-magnitude faster than existing algorithms.

Related work
The backbone of frequent subgraph mining algorithms is the enumeration strategy employed for enumerating all connected subgraphs as potentially all connected subgraphs could be frequent. Frequency and feature constraints (e.g., similar node labels) are then enforced while enumerating the subgraphs. In sparse graphs, the number of connected subgraphs is much smaller than the number of all subgraphs. Moreover, the number of subgraphs that satisfy the frequency or feature constraints is much smaller than the number of connected subgraphs.
Koyuturk et. al. [11] proposed the MULE (Mining Uniquely Labeled Edgesets) algorithm for mining frequent subgraphs of a given collection of graphs, G. A subgraph is frequent if the number of graphs it appears in, referred to as support, is at least a userspecified minimum number of graphs. Moreover, an extension to the MULE algorithm was proposed to mine the closed and maximal frequent subgraphs. A closed frequent subgraph is a frequent subgraph that does not have a supergraph with the same supporting graphs. A maximal frequent subgraph does not have any frequent supergraph. At the core of the MULE algorithm is a depth-first enumeration approach based on backtracking for visiting all connected edge-induced subgraphs of a graph. The enumeration approach in the MULE algorithm visits each subgraph in the enumeration tree only once. A subgraph is only extended with edges in the candidate edgeset. The set of candidate edges for a subgraph is defined based on the set of edges visited and the current edges in the subgraph. In the MULE algorithm, at each search node in the search space, the set of subgraphs generated from a given subgraph is not always the set of all supergraphs of a given subgraph because the missing supergraphs would be visited from other subgraphs.
Because the frequency constraint satisfies the downward closure property, the minimum support constraint is enforced while traversing the subgraph lattice and a futile search branch is pruned once an infrequent subgraph is encountered. The downward closure property guarantees that all supergraphs of an infrequent subgraph are infrequent. The number of frequent subgraphs in a graph dataset is very large, especially for small support thresholds. For downstream analysis of these frequent subgraphs, it is often desired to mine a representative set of these frequent subgraphs. A representative set is a subset of the frequent subgraphs such that every frequent subgraph not in the representative set is similar (high overlap) with at least one subgraph in the representative set. Mining a set of representative subgraphs is suitable when it is computationally infeasible to mine all frequent subgraphs. Several approaches have been proposed to mine a succinct set of frequent subgraphs, including maximal frequent and close frequent subgraphs [7]. To highlight the challenges of mining all frequent subgraphs, we run the MULE algorithm on a dataset of 35 graphs, used in the experiments (10,000 nodes, average number of edges 145,114). The MULE algorithm takes hours to generate all frequent subgraphs, depending on the minimum support threshold employed. Moreover, the MULE algorithm generates millions of frequent subgraphs, while the number of maximal subgraphs is in the thousands.
For maximal frequent subgraphs, if a frequent subgraph does not have any frequent supergraph in the enumeration tree, then it is locally maximal frequent. The MULE algorithm checks if the locally maximal frequent subgraph is a subgraph of an already mined maximal frequent subgraph to ensure that the locally maximal frequent subgraph is indeed a maximal frequent subgraph. The set of discovered maximal frequent subgraphs that has to be kept in memory can be very large and thus checking if a subgraph is a subgraph of an already discovered maximal frequent subgraph can be computationally expensive. Another limitation of the MULE algorithm is that it does not have pruning strategies that eliminate the traversal of search branches that would result in locally maximal frequent subgraphs that are not globally maximal frequent subgraphs. For the special case when the graph dataset has a single graph, and minimum support of 1, the MULE algorithm enumerates all frequent subgraphs of the single graph while in fact there is only one maximal frequent subgraph that is the graph itself.
Another approach for enumerating all connected subgraphs was proposed in [12]. The main idea of the approach is that for a given vertex, the set of all connected induced subgraphs can be partitioned into two groups: the subgraphs that have the vertex, and the subgraphs that do not have the vertex. The recursive algorithm has an amortized computation time of O(1) for each vertex-induced subgraph. The algorithm in [12] has amortized computation time while our proposed algorithm has a linear delay. The algorithm in [12] can be adapted to solve the edge-induced subgraph enumeration problem, however, it is not clear if the new algorithm will have an amortized computation time of O(1).
Reverse Search is a recent search approach for enumeration problems [13]. The basic idea of reverse search is to arrange all objects to be enumerated in a tree, where each search node has a unique parent node. A major task of a reverse search algorithm is the definition of a parent operation on the sets being enumerated that reduces a node in the tree to its unique parent node [14]. All the objects to be enumerated form an enumeration tree with tree nodes representing edges and the connections between objects and the corresponding parent are represented by edges. A child operation, defined by inverting the parent operation, determines if an object is a valid child of a given parent object. The enumeration tree is constructed by applying a depth-first traversal, starting from a canonical root and employing the child operation to generate objects. Several reverse search-based algorithms have been proposed for solving traditional enumeration problems, including all induced connected subgraphs, all spanning trees of a graph, all topological orderings of an acyclic graph, all dense subgraphs of a graph, and all maximal independent sets of a graph [13,14].
A reverse search algorithm, RS-MST, for enumerating all vertex-induced connected subgraphs has been introduced in [13] where the parent subgraph of a subgraph G is obtained by removing the vertex with the minimum degree in the spanning tree of G. A subgraph resulting from extending a subgraph G with a vertex v is a valid child of G if vertex v is a vertex with the minimum degree in the subgraph formed by adding the vertex v to the subgraph G. A similar approach can be applied for mining all edge-induced subgraphs. For these two reverse search algorithms, finding the MST to check for valid subgraph extension is a costly operation, considering that some extensions (invalid ones) will not be pursued in constructing the enumeration tree and will not be reported. The delay for the RS-MST algorithm is cubic in the number of nodes since we have to extract the minimum spanning tree for each extension and in the worst case none of the extensions is a valid child. A related problem to the enumeration of all connected induced subgraphs is the problem of enumerating all connected induced subgraphs of size at most k. Several algorithms have been proposed for solving this problem [15,16]. When k equals the number of nodes in the graph, the enumeration of all induced subgraphs of size at most k and the all connected induced subgraphs enumeration problem become identical. A recent article of the algorithms for mining all connected induced subgraphs of size at most k has recently been published [17]. In [18], we proposed a reverse search algorithm for enumerating all vertex-induced connected subgraphs of a graph. The parent operation is based on the shortest distance of the newly added vertex to the first vertex that was added to the subgraph. The algorithm outperformed existing methods for vertex-induced subgraph enumeration. Moreover, we employed the enumeration approach to mine all maximal cohesive subgraphs from vertex-attributed graphs. The proposed method takes an edge-growth approach to mine all connected frequent edgesets.

Methods
The backbone of the proposed frequent subgraphs mining algorithm is an approach to enumerate all connected edge-induced subgraphs of a single graph. We first explain our enumeration approach for all connected edge-induced subgraphs and then extend this approach to mine all frequent and maximal connected subgraphs.

Preliminaries
Let G = (V , E) be an undirected graph, where V = {v 1 , · · · , v n } denote the set of vertices and E ⊆ V 2 is the set of edges. For a vertex v i ∈ V , i is a unique identifier of that vertex, which is fixed but arbitrarily assigned.
Vertex-induced subgraph For a graph G = (V , E), and a set of vertices U ⊆ V , the vertex-induced subgraph (induced subgraph), denoted as G [ U], is the subgraph G[ U] = (U, E U ) whose vertexset is U and the edgeset E U includes all edges whose endpoints are in U.
Edge-induced subgraph For a graph G = (V , E), and an edgeset S ⊆ E, the edgeinduced subgraph, denoted as G [ S], is the subgraph G[ S] = (V S , S) whose edgeset is S and the vertexset V S includes all the endpoints of edges in S.

We call S ⊆ E a connected edgeset if its corresponding edge-induced subgraph G[ S]
is connected. A connected edge-induced subgraph can be uniquely identified by its corresponding connected edgeset and thus the two terms are used interchangeably.
Edge ordering: To maintain an edge ordering, an edge between vertices v i and v j is denoted as (i, j) where i < j. We define a total order relation on the set of edges in the graph such that (i, j) (k, ) if i < k or i equals k and j ≤ .
The distance between two edges, denoted d(e i , e j ), in a connected graph is the number of non-terminal vertices (connect between edges) in a shortest path between the edges. Using this definition, adjacent edges that share an endpoint have a distance of 1.

Edge neighborhood:
For an edge e, the set of all adjacent edges of e is referred to as the neighborhood of e, and is denoted as N(e). The neighborhood of e is defined as the set of edges with a distance of 1 to e.
Subgraph neighborhood: For an edgeset U ⊆ E, the set of neighboring edges in a graph G = (V , E), denoted as N(U), contains the set of edges not in U that have at least one neighboring edge in U.
Anchor edge: The smallest edge in an edgeset U is denoted anchor(U), i.e., anchor(U) = s such that s e i , ∀e i ∈ U \ s. Closer to anchor: For a connected edgeset U ⊆ E with s = anchor(U), and any two edges e i , e j ∈ E \ U, we say that e i is 'closer' to U than e j , denoted as (e i ≺ U e j ), if d(e i , s) < d(e j , s) or d(e i , s) = d(e j , s) and e i e j .
Utmost edge: For a connected edgeset U with s = anchor(U), the largest edge in U with the longest distance to the anchor edge is called the utmost edge and is denoted as utmost(U). If there is more than one edge whose distances equals the longest distance, we take the largest edge according to the order relation, i.e., utmost(U) = e such that e ∈ U \ s and

CEIS(G) = {S | S ⊆ E and G[ S] is connected}
In this paper, we propose a reverse search algorithm for enumerating all connected edge sets of an undirected graph.

Search graph
For a single connected graph, the enumeration of the set of connected edgesets can be represented by a directed search graph in which nodes represent connected edgesets and there is a directed edge between two edgesets, (X, Y ) if Y = X ∪ {e} and the deletion of e from Y keeps X connected. In the search graph, a search node (say, Y ) can have multiple incoming edges as multiple connected edgesets can lead to the same connected edgeset. A naive approach to traverse the entire set of all connected edgesets is to grow an edgeset by extending it with one of its neighbor edges and checking in a global list whether the edgeset has been enumerated before to avoid duplicate listings. Given the combinatorial nature of connected edgesets, this approach is inefficient as it enumerates each edgeset many times, and the number of distinct edgesets grows exponentially with the size of the graph.

Reverse search
The algorithm builds and traverses the connected edgesets search tree wherein nodes in the tree correspond to connected edgesets and arcs correspond to the parent-children relations between these edgesets. The arcs in the search tree are defined by a neighborhood function that defines a set of search nodes that can be generated from a search node; this set is referred to as the valid children of a search node. The outgoing nodes of a search node constitute the valid search nodes that can be obtained from the search node. Each edgeset appears only once in the search tree, and there is only one incoming link to each edgeset from its unique parent edgeset. We enumerate the set of connected edgesets by depth-first traversal of the search tree. In this section, we define the parent operation and a data structure that allows for efficient parent/child operations.

Parent child relationship
If a search node Y corresponding to a connected edgeset can be obtained from a unique search node (say X), then X is called the parent node and Y is called the child node. The edgeset X can be obtained by deleting a specific edge from the edgeset Y.

Lemma 1 Let U be a connected edgeset with s = anchor(U) and e = utmost(U), then G[ U − e] is also connected.
Proof We will prove this claim by contradiction. Say, for a connected edgeset U, e is an edge with the longest shortest distance from s and for contradiction, assume that deleting e results in a disconnected graph. This means that there exists at least an edge e such that all shortest paths between s and e go through e. Let p ab denote the shortest path between two edges a and b and w(p ab ) denote the length of the path. Moreover, let p se = s, · · · , e, · · · , e be a shortest path from s to e and w(p se ) denote the length of the path. So, the shortest distance between s and e , w(p se ) + w(p ee ), is greater than the shortest distance between s and e, i.e., w(p se ) > w(p se ). This contradicts our assumption that e is an edge with the longest shortest path distance from s in U. Thus, G[ U − e] is connected.

Valid children
Building on Lemma 1, we can expand a node U in the search tree to construct one of its child nodes, U * . For a connected edgeset U with s = anchor(U), e = utmost(U) and a neighboring edge e ∈ N(U) such that s e , the edgeset U * = U ∪ {e } is a valid child of U if and only if the following condition holds: 2 Both e and e have the same distance to s, but e e .
The definition of valid children ensures that the newly added edge to the child node has the longest distance from the anchor, and if multiple edges have the same longest distance to the anchor, the newly added edge is the largest. The proposed reverse search parent-child relation is the backbone of the enumeration tree neighborhood function, N : CEIS(G) → 2 CEIS(G) . The parent-child relation guarantees that a search node only appears once in the range of the neighborhood function. We build a directed search graph whose nodes correspond to connected edgesets and there is a directed edge from node X to Y if edgeset Y is a child of X. For a child Y, let X = P(Y ) denote its parent. The children of a connected edgeset X ∈ CEIS(G) in the search graph is defined as follows: If U and U ∪ {e } are edgesets corresponding to a parent and a child node, respectively, we call e a valid candidate of U, otherwise, we call it an invalid candidate of U. For an edgeset U, the set of neighboring edges, N(U), constitute the candidate edges and can be partitioned into valid and invalid candidates. Figure 1a shows a sample graph, and Fig. 1b shows the enumeration tree of the set of all connected edgesets of this graph. Edges are uniquely labeled starting from 1. The edges of an edgeset are written inside the oval shape and the set of candidate edges are written adjacent to the oval shape. Figure 1b shows that edgeset U = {1, 3} has {2, 4, 5} as the candidate edges; anchor(U) = 1 and utmost(U) = 3. Edge 2 is not a valid candidate because its distance to 1 is the same as the distance of the utmost edge 3 to edge 1 but 2 is less than 3 in the order of the edges, thus the branch corresponding to {1, 3, 2} will not be explored. Edge 4 has the same distance as edge 3 to edge 1, but since 4 is greater than 3, then edge 4 is a valid candidate. Edge 5 distance to edge 1 is larger than the distance of the utmost edge and thus it is a valid candidate. For the edgeset {2, 5} with a candidate set {1, 3, 4}, both edges 1 and 3 are not valid candidates; edge 1 is less than the anchor 2 and edge 3 has the same distance as edge 5 to edge 2 but edge 3 is less than 5; edge 4 is a valid candidate because its distance to edge 2 is larger than the distance of the utmost edge 5 to edge 2. For single-edge search nodes in level 1, if the candidate edge is larger than the anchor edge, then it is an invalid edge.

Enumerating all subgraphs of a single graph
Algorithm 1 shows the pseudo-code for our algorithm. For each edge in the graph, we call EnumerateCEIS, a recursive procedure. The procedure takes a connected edgeset E s , the set of candidate edges C and the utmost edge of the edgeset utmost(E s ). For each edge e j in the candidate set (line 6), the procedure checks if the edge is a valid candidate (line 7) for extending E s . If so, it updates the candidate set and recursively calls the EnumerateCEIS procedure (lines 8-9). The candidate set can be updated by using the current candidate set and the neighbors of the last added edge N(e j ). To update the candidate set, we add the neighbors of the current edge e j that are not already in the candidate set C or in the current edgeset E s (line 10). The isValidExtension procedure (line 14) checks if the edge e j is a valid candidate for the edgeset in E s following the rules in the valid children section.  Proof Correctness means that each initial recursive call in Algorithm 1 (line 2) with s as the anchor edge will generate all the edge-induced subgraphs whose anchor is s under the enumeration tree rooted at s. First, all single edge subgraphs will be enumerated because we output the single edge the first time we call the recursive procedure (lines 2 and 5). We will prove that all connected edge-induced subgraphs with anchor s with at least two edges will be enumerated. For any connected edgeset U ⊆ E, k = |U| ≥ 2 and s = anchor(U), we show this construction approach to obtain G[ U]. Let U s denote the sorted edges in U, U s = {e 1 , e 2 , · · · , e k }, such that e 1 = s and for all 1 s) and e i e i+1 . There is a unique sequence of recursive calls to generate this G[ U], starting the initial call with E s = {e 1 } calling the procedure with E s = {e 1 , e 2 } and ending the procedure with E s = {e 1 , e 2 , · · · , e k−1 } calling the last call with E s = {e 1 , e 2 , · · · , e k−1 , e k }. Each recursive call in this sequence will be executed because for all 2 ≤ i ≤ k, the connected edgeset E * s = {e 1 , · · · , e i } is a valid child of E s = {e 1 , · · · , e i−1 }. Note the last edge added in each call satisfies the valid child rules. This proves the completeness of the algorithm.
Next we show that the enumeration approach does not have redundant subgraph generation, i.e., each connected edge-induced subgraph is generated once. For a connected edge-induced subgraph, G[ U], with k = |U| and s = anchor(U), let U s be the set of sorted edges in U with respect to s, U s = {s, e 2 , · · · , · · · , e k }. There is a unique path from the root s to the subgraph G[ U]. The subgraph is obtained by starting from the subgraph G[ {s}] and adding one edge at a time in the same order in U s .

Complexity analysis
Since the number of reported subgraphs can be exponential with respect to the number of edges of the graph, we analyze the time the algorithm takes to report the first subgraph and a subgraph after it has generated the previous subgraph [13]. This duration is defined as Delay. An enumeration algorithm is called a polynomial delay algorithm if its delay is polynomial in the input size [19]. The proposed connected edge-induced subgraph enumeration approach is a linear delay and this is an improvement of the current best cubic delay.
Theorem 2 Algorithm 1 is a linear delay and a linear space algorithm with respect to the number of edges of the graph.
Proof We use an array-based implementation in which we maintain the set of edges of an edgeset, the candidate edges and the distance of the candidate edges to the anchor edge. Using this data structure, the anchor edge, utmost edges, and the distance of an edge to the anchor edge can be accessed in constant time. The algorithm checks if an edge is a valid candidate of the edgeset in a constant time O(1) (Algorithm 1 line 7). In the worst case scenario when all the candidate edges are invalid, the algorithm takes O(|E|) when the candidate set has all the edges. To prove the linear delay, we employ the alternative output method proposed in [19] to reduce the delay of the algorithm. The algorithm is an internal output algorithm since it outputs a solution for each recursive call. Following the alternative output method, the algorithm outputs a subgraph before starting to call  16: if e j < s then 17: return False 18: end if 19: if distance(e j , s) > distance(x, s) then 20: return True 21: end if 22: return distance(e j , s) = distance(x, s) and e j > x 23: end function the EnumerateCEIS recursive call if the depth of the recursive call is odd (Algorithm 1 line 5) and outputs the subgraph after the recursive calls for even depth (Algorithm 1 line 12). Therefore, each connected edgeset can be enumerated with linear delay O(|E|). For generating the first subgraph, the algorithm takes constant time since every initial recursive call (Algorithm 1 line 2) outputs a subgraph with single edge. If the graph has multiple connected components, we can run the algorithm for each component and the delay will be O(|E c |), where |E c | is the size of the largest connected component.
An algorithm is output polynomial, if it outputs all the elements to be enumerated in time polynomial to the number of elements. Since the proposed algorithm takes linear time for each connected edgeset, it is output (or total) polynomial in the number of connected edgesets; output polynomial follows from the polynomial delay for each output. The algorithm explores the search tree in a depth first manner, which ensures that the space used is bounded by the depth of the search tree, which is at most |E|. We use three arrays, each of size |E| to keep track of which edges are in the connected edgeset, their neighbors, and their distances to the anchor edge. So, the depth first search of the enumeration tree can be done with linear space in the depth of the enumeration tree which is O(|E|).

Mining frequent subgraphs
In many applications, we have a dataset of graphs and the goal is to extract significant subgraphs. In the frequent subgraph mining problem, the goal is to mine subgraphs that appear in at least a user-defined minimum threshold of the graphs. In this work, we are only concerned with connected frequent subgraphs.
Graph Dataset Let G = {G 1 , G 2 , · · · , G n } denote a set of n undirected graphs. For an undirected graph G i = (V , E i ), i ∈[[ 1 · · · n]], V = {v 1 , v 2 , · · · , v k } denote the set of vertices and E i ⊆ V 2 denote the set of edges of the corresponding graph. All the graphs are defined over the same set of vertices; In this work, we represent the dataset G of n graphs as an edge-attributed graph, G = (V , E, f ), where V is the set of vertices and E is the set of all the edges in the graph dataset and f is an edge attribute function. The edge attribute function f maps each edge to the set of graphs in which it appears. The set of all edges is the union of the sets of edges in each graph. We label the edges in the edge-attributed graph with unique identifiers {1, 2, · · · , |E|}. Figure 2 shows a toy graph dataset of four graphs in (a) and the corresponding edge-attributed graph in (b).
Supporting graphs Given a set of graphs G, the set of supporting graphs of an edgeinduced subgraph, G s is defined as follow: sup(G, G s ) = {G i |G s ⊆ G i and G i ∈ G}. When the graph dataset is clear from the context, we refer to the supporting graphs as sup(G s ). The cardinality of the supporting graphs is referred to as the support of the subgraph, i.e., |sup(G s )|.
Frequent subgraph Given a graph dataset G and user-specified support threshold S min , a graph G s is called frequent if the subgraph's support is equals to or greater than the support threshold, i.e., G s is a frequent subgraph if |sup(G, G s )| ≥ S min . Since an edge-induced subgraph is uniquely identified by the edgeset, we use frequent subgraphs and frequent edgesets interchangeably.
Problem definition Given a graph dataset G and a support threshold S min , the problem of mining the set of frequent subgraph is to enumerate the set: F = G s 1 , G s 2 , G s 3 , · · · , G s |F | such that every G s i ∈ F is a frequent connected subgraph, i.e., |sup(G, G s i )| ≥ S min . For the graph dataset in Fig. 2a, the set of frequent subgraph for minimum support of 3 is shown in Fig. 2c. Given a minimum support threshold S min , the anti-monotone support constraint guarantees that if a subgraph G s is frequent, then each subgraph G * of G s is also frequent, i.e., |sup(G s )| ≥ S min =⇒ for all G * ⊂ G s , the subgraph is frequent |sup(G * )| ≥ S min .
Our proposed algorithm for mining all frequent subgraphs employs the reverse search enumeration approach in Algorithm 1 to enumerate all connected subgraphs and enforcing the supporting constraint. The anti-monotone property of the support of a subgraph is employed in the mining algorithm to prune search branches when an infrequent subgraph is encountered. If an infrequent subgraph is encountered, then the recursion procedure EnumerateFCIS is not called and the search subtree rooted at this infrequent subgraph is not enumerated. The enumeration tree for the set of frequent subgraphs is shown in Fig. 3b.
The algorithm for mining frequent subgraphs is shown in Algorithm 2. In line 1, infrequent edges are pruned, and the recursive EnumerateFCIS procedure is called for each frequent edge (Line 3). The recursive procedure follows the same steps as the enumeration approach in Algorithm 1, except for the if statement in line 9 to ensure that only search branches rooted at frequent subgraphs are explored. The recursive procedure is called only from frequent children (line 11). Therefore, only frequent subgraphs will be added to the set of frequent subgraphs in line 6.

Algorithm 2 Mining All Connected Frequent Subgraphs
Input: a graph dataset, G , and a minimum support threshold, S min Output: The set F of all frequent subgraphs

Mining maximal frequent subgraphs
Because of the downward closure property of frequent subgraphs where all the subgraphs of a frequent subgraph are frequent, there is high overlap between frequent subgraphs. A representative set of all frequent subgraphs is a concise summarization of the frequent subgraphs in the dataset. We thus propose an algorithm for mining maximal frequent subgraphs. Recall that a maximal frequent subgraph is a frequent subgraph that does not have any frequent supergraph. i.e., Though not efficient, all frequent subgraphs can be extracted from the set of maximal frequent subgraphs since all subgraphs of a maximal frequent subgraph are frequent. However, the exact frequency (support) of the frequent subgraphs can not be obtained from the maximal frequent subgraphs. Due to the combinatorial nature of frequent subgraphs, the set of maximal frequent subgraphs is much smaller than the set of all frequent subgraphs.
Problem definition Given a graph dataset G and a support threshold S min , the problem of mining the set of maximal frequent subgraph is to enumerate the set: such that every G s i ∈ M is a maximal frequent connected subgraph. For the graph dataset of four graphs shown in Fig. 2a, and minimum support S min = 3, there are two maximal frequent subgraphs and they are drawn inside dotted circles in Fig. 2c. These are the same subgraphs inside dotted circles in Fig. 3c.
In the enumeration tree for mining frequent subgraphs, every leaf search node is potentially a maximal frequent subgraph. The reason for a leaf not always being a maximal frequent subgraph is that there could be an invalid subgraph of that leaf that is frequent and it was not explored because it is not a valid extension at this stage of the enumeration tree. An algorithm for mining all maximal frequent subgraphs is to enumerate the frequent subgraphs enumeration tree and to report subgraphs that do not have any frequent valid or invalid extension. This algorithm is a straightforward extension of Algorithm 2. To decide locally if a subgraph is a maximal frequent subgraph, we need to switch lines 8 and 9 in Algorithm 2. We also need a flag before line 7 that is set to true. If the extended subgraph is frequent, |sup(E s ∪ {e})| ≥ S min , the flag is set to false, indicating that the subgraph is not maximal. After the for loop, we add the subgraph to the output list if the flag is still true. Note that this approach does not need a global list of the already mined maximal frequent subgraphs such as the one employed in the MULE algorithm. Following this mining approach, the enumeration tree for maximal frequent subgraphs would look like the tree in Fig. 3b. We will need to enumerate all 20 frequent subgraphs to get the two maximal subgraphs. Enumerating the search tree of frequent subgraphs is computationally expensive, especially for low minimum support thresholds when the search tree becomes very large. A more efficient approach would be to mine the set of all maximal frequent subgraphs without enumerating the whole frequent subgraphs enumeration tree. In the following subsections, we develop pruning strategies that eliminate the need to traverse search branches without sacrificing the correctness of the results. In the experiments section, we demonstrate how the proposed pruning strategies result in a significant performance improvement.

Consumed by a sibling
For a graph dataset G, a connected frequent edgeset S ⊆ E, let e i and e j be two valid candidate edges of G [ S] such that e i is closer to anchor(S) than e j and these two extensions generate two frequent subgraphs, G[ S ∪ {e i }] and G[ S ∪ {e j }] and the set of supporting . Under this scenario, these two subgraphs are not maximal frequent subgraphs because any maximal frequent subgraph that is a supergraph of G[ S ∪ {e i }] will also be a supergraph of G[ S ∪ {e j }]. This conclusion is reached by observing that any maximal subgraph that is a supergraph of G[ S ∪{e i }] can be extended with e j without violating the minimum support threshold because we have the graphs that contain We will show that G[ S ∪ {e i , e j }] is also a frequent subgraph that can be extended from maximal ← true 10: for e i ∈ C do 11: if |sup(E s )| ≥ S min then 13: maximal ← false 14: if ISVALIDEXTENSION(E s , e i , e ) then 15: if e i covered by a smaller edge in C or covered by the parent then

Algorithm
Algorithm 3 shows the proposed RASMA algorithm. The algorithm follows the enumeration approach for mining frequent subgraphs and employs the pruning strategies to avoid visiting subtree branches that will not result in maximal frequent subgraphs. In line 1, frequent edges are extracted and then in lines 3-7, a search subtree will be traversed from each frequent edge. Frequent edges that are covered by a neighboring smaller edge will not be explored by virtue of Pruning Rule 3 (line 4). In the MineMaximalSubgraph procuedure, for each edge in the candidate edges C, if the extension would generate a frequent subgraph, then we set the maximal flag to false indicating that the current  [14][15][16][17][18][19]. We add the current subgraph to the maximal frequent subgraphs set if the maximal flag is still true, line 24.

Results
We tested the performance of RASMA on mining frequent and maximal frequent subgraphs from gene coexpression networks. Moreover, for investigating the impact of the pruning rules, we compared the running time of the algorithm with and without the pruning rules. All experiments were performed on a machine with Intel Xeon 2.40GHz processor with 16 Gbytes main memory, running the Linux operating system. The algorithms were implemented in C++ and the MULE implementation was in C.

Performance on tissue gene expression
We tested the proposed algorithm on 35 tissue gene coexpression networks constructed by the Gene Genetic Network Analysis Tool [20]. The coexpression networks were inferred from Genotype-Tissue Expression (GTEx) data 1 . Each coexpression network is constructed from the gene expression of non-diseased tissue samples. On average there are 145,114 coexpression links (edges) in each network among 9,998 genes. In total, there are 1,548,622 unique coexpression edges that appear in at least one coexpression network. Among these edge, there are 55,558 edges that appear in at least 10 networks, 4,127 appear in at least 20 networks, and 554 appear in at least 30 networks. On average each edge appears in 3.28 networks. Table 1 shows how the number of frequent and maximal frequent subgraphs (|F | and |M|) and the running times for the MULE and RASMA for mining the frequent and maximal subgraphs vary for varying minimum support thresholds. For mining the maximal subgraphs, the proposed algorithm is orders of magnitude faster than the MULE algorithm for low support thresholds. The MULE algorithm is much slower for mining maximal frequent subgraphs since it has to enumerate the same frequent subgraphs enumeration tree. Moreover, for each potential maximal subgraph the MULE algorithm checks if it has a supergraph in a global list. For mining all the frequent subgraphs, both algorithms have similar running times and for a support threshold of 15 both did not finish the mining task in two days.  Table 2 shows the topological properties of the reported subgraphs and running times of RASMA for lower support thresholds. For calculating the topological properties of the maximal frequent subgraphs, only subgraphs with at least three edges (denoted |M * |) are considered since a large percentage of the maximal frequent subgraphs have one or two edges only. The number of maximal frequent subgraphs with at least three edges (|M * |) increases for lower support and so do the average numbers of edges (|E|), nodes (|V |), and density (Density).

Effectiveness of pruning rules
The pruning rules aim at reducing the number of search nodes explored while mining the maximal frequent subgraphs. The closer to the root the pruning occurs, the more search nodes are eliminated. To investigate how the proposed pruning rules improve the performance of the algorithm, we show a comparison with the pruning rules disabled. Table 3 shows the impact of the pruning rules on the running time and on the number of frequent subgraphs explored. Although both the frequent subgraphs and maximal subgraphs (without pruning) algorithms enumerate the same frequent subgraphs search tree, it is important to notice that the algorithm without pruning is much slower than mining all frequent subgraphs. This is true since for a maximal subgraph, all immediate potential children nodes have to be check for frequency to mark the subgraph as maximal (line 12 in Algorithm 3), regardless of whether the extension is a valid child. Therefore, the number of frequency checking is much larger than the number of frequent search nodes explored. However, the algorithm for frequent subgraphs checks if an extension is frequent only for the valid children (lines 8 − 9 in Algorithm 2). The algorithm with pruning strategies traverses only a very small fraction (0.000028 for support = 16) of the frequent subgraphs while maintaining the completeness of the maximal frequent subgraphs.

Analysis of maximal frequent subgraphs
We performed a biological enrichment for the gene sets (nodes) of the maximal frequent subgraphs. A biological annotation, knowledge, is to said to be enriched in a gene set if a significant subset of the genes of the gene set are annotated with the given annotation. We tested the overrepresentation of genes with a specific annotation in a gene set using the hybergeometric test (with pvalue = 0.01). We used multiple annotation databases from the Molecular Signatures Database (MSigDB) [21,22] for assessing the enrichment of the genes in these reported subgraphs with these annotations.  Table 4 shows the percentage of the maximal frequent subgraphs whose genesets are biologically enriched for each of the three biological signatures. Some subgraphs are enriched with several signatures and some signatures are enriched in the genes of multiple subgraphs. Moreover, the reported subgraphs are enriched with a large number of biological annotations for each of the databases. Only maximal frequent subgraphs with at least three edges were considered for the analysis (denoted as |M * | in the table). The enrichment analysis shows that frequent subgraphs are highly enriched with known biological annotations.
An annotation can be enriched in many reported gene sets. We sorted the annotations by the number of subgraphs they are enriched in. Table 5 shows the top biological signatures that were enriched the most in the reported genesets of the maximal frequent subgraphs for S min = 20.

Frequent coexpression subnetworks for breast cancer stages
We constructed gene coexpression networks from breast cancer gene expression samples in the TCGA portal. We downloaded 1,310 RNA-seq samples; 113 of these samples are Solid Tissue Normal (used as control) and the remaining 1,197 samples are Primary Solid Tumor. The cancer samples belong to four different stages. For cancer and control samples for each stage, we extracted the differentially expressed genes (DEGs) using |log(FC)| > 2 and corrected p-value < 0.05 as the cutoffs. We then constructed the coexpression network among the DEGs for each cancer stage. A pair of differentially expressed genes is considered coexpressed if absolute value of the Pearson correlation coefficient (PCC) is at least 0.7. Table 6 summarizes the number of samples, DEGs, and coexpression links for  each cancer stage. There are 1,176 common DEGs genes that are dysregulated in all the stages and 2,394 unique DEGs in all the stages. Moreover, there are 9,677 common links that appear in all the four coexpression networks, and 81,204 unique links in all the networks. We mined the maximal frequent subnetworks for support thresholds of 2, 3 and 4. Table 7 show the topological and biological enrichment analysis of the reported subnetworks. For the biological enrichment analysis, we used the following biological collections from the Molecular Signatures Database.  Table 8 shows the top enriched oncogenic signatures in the gene sets of the maximal frequent subnetworks for varying support thresholds. Note that the number of enriched signatures can be different than the number of enriched subnetworks as a gene set in a subnetwork can be enriched with several signatures. The table also shows the number of subnetworks in which each signature is enriched. For S min = 2, the most enriched signature is RB_P130_DN.V1_UP. This oncogenic signature represents up-regulated genes in primary keratinocytes from RB1 and RBL2 [23]. The RB1 gene has a role in proliferation and apoptosis and the alteration of RB1 underlies both cancer development and resistance to therapy [24]. Mutational loss of RB1 has been linked to the development of breast cancer [25]. There are three oncogenic signatures that are highly enriched in the gene sets of the subnetworks for all the support thresholds.

Conclusion
Frequent coexpression subnetworks have been shown to be effective in functional annotation and subnetwork biomarker discovery. We proposed a reverse search algorithm for mining maximal frequent subgraphs. We first proposed a reverse search strategy for enumerating all edge-induced subgraphs from a single graph. The enumeration approach is then employed for mining frequent and maximal subgraphs. To eliminate search branches that will not result in maximal frequent subgraphs, we proposed pruning strategies that employ the order in which branches are enumerated. The pruning strategies are possible because the reverse search enforces strict definition on the order in which search nodes are enumerated. Experiments on gene coexpression datasets demonstrate the effectiveness of the proposed approaches. The proposed approach is thousands of times faster than the existing algorithm. Enrichment analysis of the genesets in the maximal frequent subgraphs reveal that maximal frequent coexpression subnetworks are enriched with known biological annotations.