Reputation: 3260
I have a bi-directional graph containing multiple clusters and non-trivial cycles, i.e. the graph contains cycles involving 3 or more nodes. The graph is represented as an edge table and the list of nodes is the set of unique nodes listed in the edge table.
A toy example of such a graph can be generated using the following SQL:
with clusters as (
select
seq8() as index,
index % 5 as five_mod,
round((index - five_mod) / 5) as multiplier,
index as left_node,
case
when five_mod = 0 then 5 * multiplier
when five_mod = 1 then 5 * multiplier
when five_mod = 2 then 5 * multiplier
when five_mod = 3 then 5 * (multiplier + 1)
when five_mod = 4 then 5 * (multiplier + 1)
end as right_node
from table(generator(rowcount => 5 * 2 + 3))
)
, cycles as (
select
seq8() as index,
index % 5 as five_mod,
round((index - five_mod) / 5) as multiplier,
index as left_node,
case
when five_mod = 1 then index + 1
when five_mod = 3 then index + 1
end as right_node
from table(generator(rowcount => 5 * 2 + 3))
where right_node is not null
)
, edges as (
select left_node, right_node from clusters
union all
select left_node, right_node from cycles
)
select * from edges;
(The above creates a graph containing "bowtie-shaped" clusters of 5 elements. E.g., for the nodes 3, 4, 5, 6, 7 the nodes 3, 4, 6, 7 are all connected to node 5. Additionally, nodes 3, 4 and nodes 6, 7 are connected.)
The problem
I would like to identify clusters on this graph based on reachability, i.e., two nodes belong to the same cluster if there exists a path (sequence of connected nodes) between them. In other words, I want to build a new table that contains two columns (node_idx, cluster_idx) and one row per node which links the node to its cluster.
For the toy example above, the result would look like this:
node_idx | cluster_idx |
---|---|
0 | 1 |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 2 |
5 | 2 |
6 | 2 |
7 | 2 |
8 | 3 |
9 | 3 |
10 | 3 |
11 | 3 |
12 | 3 |
I know that this can be solved with a recursive SQL query. What I don't know is how to deal with the cycles in a way that (1) avoids infinite recursion, (2) scales to a large (>500M) number of nodes, and (3) is valid Snowflake SQL.
Does somebody know how to solve this problem?
Upvotes: 1
Views: 208
Reputation: 1918
As @ravenspoint noted, this is more of a graph problem than a SQL problem. My recommendation would be to build a Snowpark for Python stored procedure using a graph library. Two such libraries present in the Snowpark repo are python-igraph and NetworkX. Here is a start on determining cliques with igraph. NetworkX also has you covered on cliques
While you could build a SQL procedure following the above recipe, those loop constructs are going to have horrible performance, particularly with the volume of data you are working with.
Upvotes: 1
Reputation: 20462
In graph theory, what you call a 'cluster' is usually termed a component or a clique.
In any case, a subset of vertices that are all reachable from each other and from which vertices outside the clique are unreachable.
The algorithm for finding these is straightforward.
I have no idea how you would implement that in pure SQL. I recommend loading the adjacency lists from your database into your favorite graph theory library and either using a library method to find components, if available, or coding the algorithm using library methods. This will be a lot easier and faster!
Upvotes: 1