Compute clusters (transitive closure) on a bidirectional cyclic graph in Snowflake SQL

I have a bi-directional graph containing multiple clusters and non-trivial cycles, i.e. the graph contains cycles involving 3 or more nodes. The graph is represented as an edge table and the list of nodes is the set of unique nodes listed in the edge table.

A toy example of such a graph can be generated using the following SQL:

with clusters as (
    select 
        seq8() as index,
        index % 5 as five_mod,
        round((index - five_mod) / 5) as multiplier,
        index as left_node,
        case 
            when five_mod = 0 then 5 * multiplier
            when five_mod = 1 then 5 * multiplier
            when five_mod = 2 then 5 * multiplier
            when five_mod = 3 then 5 * (multiplier + 1)
            when five_mod = 4 then 5 * (multiplier + 1)
        end as right_node
    from table(generator(rowcount => 5 * 2 + 3))
)
, cycles as (
    select 
        seq8() as index,
        index % 5 as five_mod,
        round((index - five_mod) / 5) as multiplier,
        index as left_node,
        case 
            when five_mod = 1 then index + 1
            when five_mod = 3 then index + 1
        end as right_node
    from table(generator(rowcount => 5 * 2 + 3))
    where right_node is not null
)
, edges as (
    select left_node, right_node from clusters
    union all
    select left_node, right_node from cycles
)
select * from edges;

(The above creates a graph containing "bowtie-shaped" clusters of 5 elements. E.g., for the nodes 3, 4, 5, 6, 7 the nodes 3, 4, 6, 7 are all connected to node 5. Additionally, nodes 3, 4 and nodes 6, 7 are connected.)

The problem

I would like to identify clusters on this graph based on reachability, i.e., two nodes belong to the same cluster if there exists a path (sequence of connected nodes) between them. In other words, I want to build a new table that contains two columns (node_idx, cluster_idx) and one row per node which links the node to its cluster.

For the toy example above, the result would look like this:

node_idx	cluster_idx
0	1
1	1
2	1
3	2
4	2
5	2
6	2
7	2
8	3
9	3
10	3
11	3
12	3

I know that this can be solved with a recursive SQL query. What I don't know is how to deal with the cycles in a way that (1) avoids infinite recursion, (2) scales to a large (>500M) number of nodes, and (3) is valid Snowflake SQL.

Does somebody know how to solve this problem?

Upvotes: 1

Answers (2)

Dave Welden

Reputation: 1918

As @ravenspoint noted, this is more of a graph problem than a SQL problem. My recommendation would be to build a Snowpark for Python stored procedure using a graph library. Two such libraries present in the Snowpark repo are python-igraph and NetworkX. Here is a start on determining cliques with igraph. NetworkX also has you covered on cliques

While you could build a SQL procedure following the above recipe, those loop constructs are going to have horrible performance, particularly with the volume of data you are working with.

Upvotes: 1

ravenspoint

Reputation: 20462

In graph theory, what you call a 'cluster' is usually termed a component or a clique.

In any case, a subset of vertices that are all reachable from each other and from which vertices outside the clique are unreachable.

The algorithm for finding these is straightforward.

Create empty list of cliques
LOOP A
- Create empty clique
- SELECT arbitrary unmarked vertex in graph
- COPY selected vertex into clique
- MARK selected vertex
- LOOP B over unmarked vertices in graph
  - SET found FALSE
  - LOOP C over vertices in clique
    - IF B connected to C
      - COPY selected vertex into clique
      - MARK selected vertex
      - SET found TRUE
      - BREAK out of LOOP B
    - IF found == FALSE
      - BREAK out of LOOP C
  - IF found == FALSE
    - ADD clique to list of cliques
  - IF all vertices marked
    - OUTPUT list of cliques
    - STOP
ENDLOOP A

I have no idea how you would implement that in pure SQL. I recommend loading the adjacency lists from your database into your favorite graph theory library and either using a library method to find components, if available, or coding the algorithm using library methods. This will be a lot easier and faster!

Upvotes: 1

Compute clusters (transitive closure) on a bidirectional cyclic graph in Snowflake SQL

Answers (2)

Related Questions