FirefoxMetzger
FirefoxMetzger

Reputation: 3260

Compute clusters (transitive closure) on a bidirectional cyclic graph in Snowflake SQL

I have a bi-directional graph containing multiple clusters and non-trivial cycles, i.e. the graph contains cycles involving 3 or more nodes. The graph is represented as an edge table and the list of nodes is the set of unique nodes listed in the edge table.

A toy example of such a graph can be generated using the following SQL:

with clusters as (
    select 
        seq8() as index,
        index % 5 as five_mod,
        round((index - five_mod) / 5) as multiplier,
        index as left_node,
        case 
            when five_mod = 0 then 5 * multiplier
            when five_mod = 1 then 5 * multiplier
            when five_mod = 2 then 5 * multiplier
            when five_mod = 3 then 5 * (multiplier + 1)
            when five_mod = 4 then 5 * (multiplier + 1)
        end as right_node
    from table(generator(rowcount => 5 * 2 + 3))
)
, cycles as (
    select 
        seq8() as index,
        index % 5 as five_mod,
        round((index - five_mod) / 5) as multiplier,
        index as left_node,
        case 
            when five_mod = 1 then index + 1
            when five_mod = 3 then index + 1
        end as right_node
    from table(generator(rowcount => 5 * 2 + 3))
    where right_node is not null
)
, edges as (
    select left_node, right_node from clusters
    union all
    select left_node, right_node from cycles
)
select * from edges;

(The above creates a graph containing "bowtie-shaped" clusters of 5 elements. E.g., for the nodes 3, 4, 5, 6, 7 the nodes 3, 4, 6, 7 are all connected to node 5. Additionally, nodes 3, 4 and nodes 6, 7 are connected.)

The problem

I would like to identify clusters on this graph based on reachability, i.e., two nodes belong to the same cluster if there exists a path (sequence of connected nodes) between them. In other words, I want to build a new table that contains two columns (node_idx, cluster_idx) and one row per node which links the node to its cluster.

For the toy example above, the result would look like this:

node_idx cluster_idx
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3

I know that this can be solved with a recursive SQL query. What I don't know is how to deal with the cycles in a way that (1) avoids infinite recursion, (2) scales to a large (>500M) number of nodes, and (3) is valid Snowflake SQL.

Does somebody know how to solve this problem?

Upvotes: 1

Views: 208

Answers (2)

Dave Welden
Dave Welden

Reputation: 1918

As @ravenspoint noted, this is more of a graph problem than a SQL problem. My recommendation would be to build a Snowpark for Python stored procedure using a graph library. Two such libraries present in the Snowpark repo are python-igraph and NetworkX. Here is a start on determining cliques with igraph. NetworkX also has you covered on cliques

While you could build a SQL procedure following the above recipe, those loop constructs are going to have horrible performance, particularly with the volume of data you are working with.

Upvotes: 1

ravenspoint
ravenspoint

Reputation: 20462

In graph theory, what you call a 'cluster' is usually termed a component or a clique.

In any case, a subset of vertices that are all reachable from each other and from which vertices outside the clique are unreachable.

The algorithm for finding these is straightforward.

  • Create empty list of cliques
  • LOOP A
    • Create empty clique
    • SELECT arbitrary unmarked vertex in graph
    • COPY selected vertex into clique
    • MARK selected vertex
    • LOOP B over unmarked vertices in graph
      • SET found FALSE
      • LOOP C over vertices in clique
        • IF B connected to C
          • COPY selected vertex into clique
          • MARK selected vertex
          • SET found TRUE
          • BREAK out of LOOP B
        • IF found == FALSE
          • BREAK out of LOOP C
      • IF found == FALSE
        • ADD clique to list of cliques
      • IF all vertices marked
        • OUTPUT list of cliques
        • STOP
  • ENDLOOP A

I have no idea how you would implement that in pure SQL. I recommend loading the adjacency lists from your database into your favorite graph theory library and either using a library method to find components, if available, or coding the algorithm using library methods. This will be a lot easier and faster!

Upvotes: 1

Related Questions