Reputation: 1431
How to partition a set of links in a link_tbl defined by the from_node
and to_node
. There are 10M nodes and 25M links (no more than 20 links per node).
The graph consists of three disjoint subgraphs.
create table link_tbl as (
select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
select 'AB', 'A', 'B' from dual union all
select 'CA', 'C', 'A' from dual union all
select 'GE', 'G', 'E' from dual union all
select 'CB', 'C', 'B' from dual union all
select 'EF', 'E', 'F' from dual union all
select 'NM', 'N', 'M' from dual
);
--compute subnetid
select * from link_tbl order by subnetid;
To get a result set with subnetid
values?
I can use a variant of flood fill in Java to assign subgraph id to each node of a graph. But can this be done in SQL?
Pseudocode:
- Rank nodes with consecutive integers.
- Represent links as (node1*100M + node2) as number(19,0)
- Union links node1,node2 with node2, node1 and sort
- Get first node as an anchor node
and add to subgraph_nodes'
- Iterate over link table
-
add node2to subgraph_nodes if node1 is in (subgraph_nodes) AND node2 not in subgraph_nodes
This should add all connected nodes to subgraph_nodes
This will be sufficient because now I can add subgraph_id' to node table and select all nodes without subgraph id
and repeat subgraph analysis.
There are 10M nodes represented as consecutive integers and 25M links (no more than 20 links per node), represented as (from_node*100M + to_node) as ID, from_node, to_node
.
Upvotes: 2
Views: 534
Reputation: 32695
A while ago I wrote an answer to the question How to find all connected subgraphs of an undirected graph. It was written for SQL Server, but Oracle supports standard recursive queries, so it was easy to convert it to Oracle. It may be written more efficiently using constructs specific to Oracle.
Sample data
create table link_tbl as (
select 'AY' as linkid, 'A' as from_node, 'Y' as to_node from dual union all
select 'AB', 'A', 'B' from dual union all
select 'CA', 'C', 'A' from dual union all
select 'GE', 'G', 'E' from dual union all
select 'CB', 'C', 'B' from dual union all
select 'EF', 'E', 'F' from dual union all
select 'NM', 'N', 'M' from dual
);
Query
WITH
CTE_Nodes
AS
(
SELECT from_node AS Node
FROM link_tbl
UNION
SELECT to_node AS Node
FROM link_tbl
)
,CTE_Pairs
AS
(
SELECT from_node AS Node1, to_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
UNION
SELECT to_node AS Node1, from_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
)
,CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
SELECT
CAST(CTE_Nodes.Node AS varchar(2000)) AS AnchorNode
, Node1
, Node2
, CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
, 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Nodes ON CTE_Nodes.Node = CTE_Pairs.Node1
UNION ALL
SELECT
CTE_Recursive.AnchorNode
, CTE_Pairs.Node1
, CTE_Pairs.Node2
, CAST(CTE_Recursive.NodePath || CTE_Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
CTE_Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = CTE_Pairs.Node1
WHERE
CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_RecursionResult
AS
(
SELECT AnchorNode, Node1, Node2
FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
SELECT AnchorNode, Node1 AS Node
FROM CTE_RecursionResult
UNION
SELECT AnchorNode, Node2 AS Node
FROM CTE_RecursionResult
)
SELECT
CTE_Nodes.Node
,LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node) AS GroupMembers
,DENSE_RANK() OVER (ORDER BY LISTAGG(CTE_CleanResult.Node, ',') WITHIN GROUP (ORDER BY CTE_CleanResult.Node)) AS GroupID
FROM
CTE_Nodes
INNER JOIN CTE_CleanResult ON CTE_CleanResult.AnchorNode = CTE_Nodes.Node
GROUP BY
CTE_Nodes.Node
ORDER BY
GroupID
,CTE_Nodes.Node
;
Result
+------+--------------+---------+
| NODE | GROUPMEMBERS | GROUPID |
+------+--------------+---------+
| A | A,B,C,Y | 1 |
| B | A,B,C,Y | 1 |
| C | A,B,C,Y | 1 |
| Y | A,B,C,Y | 1 |
| E | E,F,G | 2 |
| F | E,F,G | 2 |
| G | E,F,G | 2 |
| M | M,N | 3 |
| N | M,N | 3 |
+------+--------------+---------+
https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e61cf73824e7718a4686430ccd7398e7
CTE_Nodes
CTE_Nodes
gives the list of all Nodes that appear in both from_node
and to_node
columns.
Since they can appear in any order we UNION
both columns together. UNION
also removes any duplicates.
+------+
| NODE |
+------+
| A |
| B |
| C |
| E |
| F |
| G |
| M |
| N |
| Y |
+------+
CTE_Pairs
CTE_Pairs
gives the list of all edges of the graph in both directions. Again, UNION
is used to remove any duplicates.
+-------+-------+
| NODE1 | NODE2 |
+-------+-------+
| A | B |
| A | C |
| A | Y |
| B | A |
| B | C |
| C | A |
| C | B |
| E | F |
| E | G |
| F | E |
| G | E |
| M | N |
| N | M |
| Y | A |
+-------+-------+
CTE_Recursive
CTE_Recursive
is the main part of the query that recursively traverses the graph starting from each unique Node.
These starting rows are produced by the first part of UNION ALL
.
The second part of UNION ALL
recursively joins to itself linking Node2
to Node1
.
Since we pre-made CTE_Pairs
with all edges written in both directions, we can always link only Node2
to Node1
and we'll get all paths in the graph.
At the same time the query builds NodePath
- a string of comma-delimited Nodes that have been traversed so far.
It is used in the WHERE
filter:
CTE_Recursive.NodePath NOT LIKE CAST('%,' || CTE_Pairs.Node2 || ',%' AS varchar(2000))
As soon as we come across the Node that had been included in the Path before, the recursion stops as the list of connected nodes is exhausted.
AnchorNode
is the starting Node for the recursion, it will be used later to group results.
Lvl
is not really used, I included it for better understanding of what is going on.
+------------+-------+-------+-----------+-----+
| ANCHORNODE | NODE1 | NODE2 | NODEPATH | LVL |
+------------+-------+-------+-----------+-----+
| A | A | Y | ,A,Y, | 1 |
| A | A | C | ,A,C, | 1 |
| A | A | B | ,A,B, | 1 |
| B | B | C | ,B,C, | 1 |
| B | B | A | ,B,A, | 1 |
| C | C | B | ,C,B, | 1 |
| C | C | A | ,C,A, | 1 |
| E | E | G | ,E,G, | 1 |
| E | E | F | ,E,F, | 1 |
| F | F | E | ,F,E, | 1 |
| G | G | E | ,G,E, | 1 |
| M | M | N | ,M,N, | 1 |
| N | N | M | ,N,M, | 1 |
| Y | Y | A | ,Y,A, | 1 |
| Y | A | B | ,Y,A,B, | 2 |
| C | A | B | ,C,A,B, | 2 |
| Y | A | C | ,Y,A,C, | 2 |
| B | A | C | ,B,A,C, | 2 |
| C | A | Y | ,C,A,Y, | 2 |
| B | A | Y | ,B,A,Y, | 2 |
| C | B | A | ,C,B,A, | 2 |
| A | B | C | ,A,B,C, | 2 |
| B | C | A | ,B,C,A, | 2 |
| A | C | B | ,A,C,B, | 2 |
| G | E | F | ,G,E,F, | 2 |
| F | E | G | ,F,E,G, | 2 |
| B | A | Y | ,B,C,A,Y, | 3 |
| C | A | Y | ,C,B,A,Y, | 3 |
| Y | B | C | ,Y,A,B,C, | 3 |
| Y | C | B | ,Y,A,C,B, | 3 |
+------------+-------+-------+-----------+-----+
CTE_CleanResult
CTE_CleanResult
leaves only relevant parts from CTE_Recursive
and again merges both Node1
and Node2
using UNION
.
+------------+------+
| ANCHORNODE | NODE |
+------------+------+
| A | A |
| A | B |
| A | C |
| A | Y |
| B | A |
| B | B |
| B | C |
| B | Y |
| C | A |
| C | B |
| C | C |
| C | Y |
| E | E |
| E | F |
| E | G |
| F | E |
| F | F |
| F | G |
| G | E |
| G | F |
| G | G |
| M | M |
| M | N |
| N | M |
| N | N |
| Y | A |
| Y | B |
| Y | C |
| Y | Y |
+------------+------+
Final SELECT
Now we need to build a string of comma-separated Node
values for each AnchorNode
.
LISTAGG
does it.
DENSE_RANK()
calculates the GroupID
numbers for each AnchorNode
.
You have rather large tables, so trying to find all groups at once in a single query above may be quite inefficient.
One way to make it more efficient is not to use a single query for the whole dataset. Don't try to find all subnets/groups at the same time. Limit the starting point to one node. Add WHERE CTE_Nodes.Node = 'some node'
to the first part of CTE_Recursive
. The query will find all nodes of one subnet. Delete these found nodes from the big table, pick another starting node, repeat in a loop until the big table is empty. If you materialise CTE_Nodes
and CTE_Pairs
as temp tables with indexes, it may help as well.
I have never worked with Oracle and don't know its syntax for procedural code, so I'll write in a pseudocode below.
Prepare temp tables
CREATE TABLE Nodes AS
(
SELECT from_node AS Node
FROM link_tbl
UNION
SELECT to_node AS Node
FROM link_tbl
);
CREATE INDEX IX_Node ON Nodes (Node);
CREATE TABLE Pairs AS
(
SELECT from_node AS Node1, to_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
UNION
SELECT to_node AS Node1, from_node AS Node2
FROM link_tbl
WHERE from_node <> to_node
);
CREATE INDEX IX_Node1 ON Pairs (Node1);
CREATE INDEX IX_Node2 ON Pairs (Node2);
CREATE TABLE Subgraph AS
(
SELECT Node FROM Nodes WHERE 1=0
);
CREATE TABLE Result
(
GroupID int NOT NULL,
Node varchar(10) NOT NULL
);
SET :GroupID = 0;
The main loop starts
Pick one Node
from Nodes
. Any row will do. Save the Node
into variable. Again, I don't know proper Oracle syntax for it.
SELECT :N = Node FROM Nodes WHERE rownum=1;
If Nodes
is empty, stop the loop.
SET :GroupID = :GroupID + 1;
Run the recursive query starting the recursion from one specific Node
chosen above.
INSERT INTO Subgraph (Node)
WITH
CTE_Recursive (AnchorNode, Node1, Node2, NodePath, Lvl)
AS
(
SELECT
CAST(Nodes.Node AS varchar(2000)) AS AnchorNode
, Node1
, Node2
, CAST(',' || Node1 || ',' || Node2 || ',' AS varchar(2000)) AS NodePath
, 1 AS Lvl
FROM
Pairs
INNER JOIN Nodes ON Nodes.Node = Pairs.Node1
WHERE
Nodes.Node = :N -- 'A'
-- use variable here, don't know what the syntax is
UNION ALL
SELECT
CTE_Recursive.AnchorNode
, Pairs.Node1
, Pairs.Node2
, CAST(CTE_Recursive.NodePath || Pairs.Node2 || ',' AS varchar(2000)) AS NodePath
, CTE_Recursive.Lvl + 1 AS Lvl
FROM
Pairs
INNER JOIN CTE_Recursive ON CTE_Recursive.Node2 = Pairs.Node1
WHERE
CTE_Recursive.NodePath NOT LIKE CAST('%,' || Pairs.Node2 || ',%' AS varchar(2000))
)
,CTE_Result
AS
(
SELECT Node1 AS Node
FROM CTE_Recursive
UNION
SELECT Node2 AS Node
FROM CTE_Recursive
)
SELECT Node FROM CTE_Result
;
This query will return all nodes that are connected to the given starting node, i.e. those that form a subgraph.
Save its result set into a Subgraph
temp table.
Append result into the final Result
table and allocate an ID to the found subgraph.
INSERT INTO Result (GroupID, Node)
SELECT :GroupID, Node
FROM Subgraph;
Delete processed nodes from Nodes
and Pairs
.
DELETE FROM Nodes
WHERE Node IN (SELECT Node FROM Subgraph)
;
DELETE FROM Pairs
WHERE
Node1 IN (SELECT Node FROM Subgraph)
OR
Node2 IN (SELECT Node FROM Subgraph)
;
Clean the Subgraph
table
DELETE FROM Subgraph;
Return to the start of the loop.
In the end Result
table will have all nodes with the corresponding ID of the subgraph.
Actually, you can simplify it further. You don't need the Nodes
table, Pairs
is enough.
Upvotes: 5