Figure 7.9 shows a graph that contains 6 vertices and 11 edges along with one of its possible subgraphs. The subgraph, which is shown in Figure 7.9(b), contains only 4 of the 6 vertices and 4 of the 11 edges in the original graph.
(a) Labeled graph. (b) Subgraph.
Figure 7.9, Example of a subgraph.
Definition 7.5 (Support). Given a collection of graphs Q, the support for a subgraph g is defined as the fraction of all graphs that contain g as its subgraph, i.e.:
l {G , lgesGe, G, i ,e8 } l t0 l
s(g) : (7.2)
444 Chapter 7 Association Analysis: Advanced Concepts
Subgraph 9., a O ‘ – { ‘ e
support = 80%
Subgraph g,
af,J___ad
/ t O e
support = 60%
Graph Data Set support = 40%
Figure 7.10. Computing the support of a subgraph from a set of graphs.
Example 7.2. Consider the five graphs, G1 through Gr, shown in Figure 7.10. The graph pr shown on the top right-hand diagram is a subgraph of G1, Gs, G+, and Gs. Therefore, s(gr) : 415 : 80%. Similarly, we can show that s(gz) :60% because 92 is a subgraph of G1, G2, and G3, while s(gs) : lO% because 93 is a subgraph of G1 and G3. I
7.5.2 FYequent Subgraph Mining
This section presents a formal definition of the frequent subgraph mining prob- Iem and illustrates the complexity of this task.
Definition 7.6 (Flequent Subgraph Mining). Given a set of graphs f and a support threshold, m’insup, the goal of frequent subgraph mining is to find all subgraphs g such that s(9) > rnxnsup.
While this formulation is generally applicable to any type of graph, the discussion presented in this chapter focuses primarily on undirected, con- nected graphs. The definitions of these graphs are given below:
1. A graph is connected ifthere exists a path between every pair ofvertices in the graph, in which a path is a sequence of vertic€s 1u1u2…ute )
7.5 Subgraph Patterns 445
such that there is an edge connecting between every pair of adjacent vertices (ro,r*r) in the sequence.
2. A graph is undirected if it contains only undirected edges. An edge (ut,ui) is undirected if it is indistinguishable from (ui,ut).
Methods for handling other types of subgraphs (directed or disconnected) are left as an exercise to the readers (see Exercise 15 on page 482).
Mining frequent subgraphs is a computationally expensive task because of the exponential scale of the search space. To illustrate the complexity of this task, consider a data set that contains d entities. In frequent itemset mining, each entity is an item and the size of the search space to be explored is 2d, which is the number of candidate itemsets that can be generated. In frequent subgraph mining, each entity is a vertex and can have up to d – 1 edges to other vertices. Assuming that the vertex labels are unique, the total number of subgraphs is
x 2i(i-r)/2 )
where (f) is tne number of ways to choose i vertices to form a subgraph and
2i(i-L)12 is the maximum number of edges between vertices. Table 7.8 compares the number of itemsets and subgraphs for different values of d.
Table 7,8. A comparison between number of itemsets and subgraphs for different dimensionality, d.
Number of entities, d I 2 J A ( f) 7 8
Number of itemsets 2 ,4 8 16 32 o4 t28 256 Number of subgraphs 2 18 113 1,450 40,069 2,350,602 28,619,25L3
The number of candidate subgraphs is actually much smaller because the numbers given in Table 7.8 include subgraphs that are disconnected. Discon- nected subgraphs are usually ignored because they are not as interesting as connected subgraphs.
A brute-force method for doing this is to generate all connected subgraphs as candidates and count their respective supports. For example, consider the graphs shown in Figure 7.11(a). Assuming that the vertex labels are chosen from the set {a, b} and the edge labels are chosen from the set {p, q}, the list of connected subgraphs with one vertex up to three vertices is shown in Figure 7.11(b). The number of candidate subgraphs is considerably larger than the
d , , u
I(;)
446 Chapter 7 Association Analysis: Advanced Concepts
(a) Example of a graph data set
G4G2G 1
@@k=1
D n €_fi9
D ^ €HD
o ^ QHD
Q ^
€H9 o ^
€H9 o ^(9HU
k=3
(b) List of connected subgraphs.
Figure 7.’11, Brute{orce method for mining frequent subgraphs.
number of candidate itemsets in traditional association rule minine for the following reasons:
1. An item can appear at most once in an itemset, whereas a vertex label can appear more than once in a graph.
2. The same pair of vertex labels can have multiple choices of edge labels.
Given the large number of candidate subgraphs, a brute-force method may break down even for moderately sized graphs.
c+.4
\I’ €J
7.5 Subgraph Patterns 447
(a,b,p) (a,b,q) (a,b,r) (b,c,p) (b,c,q) (b,c,r) (d,e,0 G 1 1 0 0 0 0 1 0 G2 1 0 0 0 0 0 0 G3 0 0 1 1 0 0 0 G4 0 0 0 0 0 0 0
Figure 7,12, Mapping a collection of graph structures into market basket transactions.
7.5.3 Apriori-Iike Method
This section examines how an Apri.ori,-like algorithm can be developed for finding frequent subgraphs.
Data Tlansformation
One possible approach is to transform each graph into a transaction-like for- mat so that existing algorithms such as Apri,ori, can be applied. Figure 7.12 illustrates how to transform a collection of graphs into its equivalent market basket representation. In this representation, each combination of edge Ia- bel l(e) with its corresponding vertex labels, (I(ut),l(ui)), is mapped into an
“item.” The width of the “transaction” is given by the number of edges in the graph. Despite its simplicity, this approach works only if every edge in a graph has a unique combination of vertex and edge labels. Otherwise, such graphs
cannot be accurately modeled using this representation.
General Structure of the fYequent Subgraph Mining Algorithm
An Apriori,-like algorithm for mining frequent subgraphs consists of the fol- lowing steps:
1. Candidate generation, which is the process of merging pairs of fre- quent (k – l)-subgraphs to obtain a candidate k-subgraph.
G3G2G 1 G4
448 Chapter 7 Association Analysis: Advanced Concepts
2. Candidate pruning, which is the process of discarding all candidate k-subgraphs that contain infrequent (k – 1)-subgraphs.
3. Support counting, which is the process of counting the number of graphs in Q that contain each candidate.
4. Candidate elimination, which discards all candidate subgraphs whose support counts are less than minsup.
The specific details of these steps are discussed in the remainder of this section.
7.5.4 Candidate Generation
During candidate generation, a pair of frequent (k – l)-subgraphs are merged to form a candidate k-subgraph. The first question is how to define k, the size of a subgraph. In the example shown in Figure 7.7I, k refers to the number of vertices in the graph. This approach of iteratively expanding a subgraph by adding an extra vertex is known as vertex growing. Alternatively, k may refer to the number of edges in the graph. This approach of adding an extra edge to the existing subgraphs is known as edge growing.
To avoid generating duplicate candidates, we may impose an additional condition for merging, that the two (k – 1)-subgraphs must share a common (k-2)-subgraph. The common (k-2)-subgraph is known as their core. Below, we briefly describe the candidate generation procedure for both vertex-growing and edge-growing strategies.
Candidate Generation via Vertex Growing
Vertex growing is the process of generating a new candidate by adding a new vertex into an existing frequent subgraph. Before describing this approach, let us first consider the adjacency matrix representation of a graph. Each entry M(i,j) in the matrix contains either the label of the edge connecting between the vertices ui and uji or zero, if there is no edge between them. The vertex-growing approach can be viewed as the process of generating a k x k adjacency matrix by combining a pair of (k – 1) x (k – 1) adjacency matrices, as illustrated in Figure 7.13. G1 and G2 are two graphs whose adjacency matrices are given by M(GI) and M(G2), respectively. The core for the graphs is indicated by dashed lines in the diagram. The procedure for generating candidate subgraphs via vertex growing is presented next.
t – – – – _ _ – – _ – – – – l
7.5 Subgraph Patterns 449
G3 = merge (G1, G2)