Reputation: 8762
I'm working with a dataset in SQL Server that represents paths between nodes, where each node has a visibility flag (IsVisible). My goal is to compress these paths by skipping over nodes marked as not visible (IsVisible = 0) and directly connecting the visible nodes. For example, given a path like A -> B(false) -> C(false) -> D, I want to transform it into A -> D.
Each path is represented as a pair of start and end nodes in my table, along with visibility flags for both the start and end nodes. Here's the structure of my table and some example data:
CREATE TABLE Paths (
StartNode CHAR(1),
EndNode CHAR(1),
StartNodeIsVisible BIT,
EndNodeIsVisible BIT
);
INSERT INTO Paths (StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible) VALUES
('A', 'B', 1, 0),
('B', 'C', 0, 0),
('C', 'D', 0, 1);
('D', 'E', 1, 1);
Desired result:
| StartNode | EndNode | Depth(optional) |
|-----------|---------|-----------------|
| A | D | 3 |
| D | E | 1? |
I attempted to use a recursive Common Table Expression (CTE) to achieve this path compression, but I'm encountering difficulties in correctly skipping over the non-visible nodes and directly connecting the visible ones. Here's what I've tried:
WITH RecursivePaths AS (
SELECT StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible, 1 AS Depth
FROM Paths
WHERE StartNodeIsVisible = 1
UNION ALL
SELECT rp.StartNode, p.EndNode, rp.StartNodeIsVisible, p.EndNodeIsVisible, Depth + 1
FROM RecursivePaths rp
JOIN Paths p ON rp.EndNode = p.StartNode
WHERE p.EndNodeIsVisible = 1 OR (p.EndNodeIsVisible = 0 AND Depth < 50)
)
SELECT *
FROM RecursivePaths
WHERE EndNodeIsVisible = 1
OPTION (MAXRECURSION 200);
I've added the OPTION (MAXRECURSION 200)
because I was getting:
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
While this code works for small dataset, in scale of ~15,000,000 rows in paths table, it making a lot of reads: ~600,000,000 (I've needed to stopped the it in the middle).
Questions:
Upvotes: 0
Views: 131
Reputation: 9272
Here's an alternative suggestion, which makes use of Graph tables, which is a feature available in SQL Server 2017 i think.
Graph tables allows inferring relationships between rows, path traversals etc etc, you can read about them here https://learn.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver16
Graph tables have nodes, which is what you already have and edges which is connection between nodes, you have an implied connection inside the table, but we need a separate table which is very easy to create.
I cannot test the performance of this, but it might be interesting to try out on your real data.
Table population script:
SELECT *
INTO #t
FROM (
VALUES
('A', 'B', 1, 0),
('B', 'C', 0, 0),
('C', 'D', 0, 1),
('D', 'E', 1, 1),
('B', 'Q', 1, 1),
('L', 'M', 1, 0),
('M', 'Q', 0, 1)
) x (Node, Node2, StartVisible, EndVisible)
CREATE TABLE Paths (
ID INTEGER IDENTITY PRIMARY KEY, name VARCHAR(10), name_orig varchar(10), start bit)
AS NODE;
CREATE TABLE connects (start bit, finish bit) AS EDGE;
INSERT INTO Paths
SELECT concat(node, NULLIF(row_number() OVER(partition BY node ORDER BY startvisible DESC), 1)), node, startvisible
FROM (
SELECT node, startvisible
FROM #t
UNION
SELECT node2, 0
FROM #t
) x
INSERT INTO connects
SELECT pf.$node_id, pT.$node_id, t.startVisible, EndVisible
FROM #t t
INNER JOIN Paths pF
ON pF.name_orig = t.Node
AND pf.start = t.startVisible
INNER JOIN Paths pT
ON pT.name_orig= t.Node2
AND pT.start = 0
Node | Node2 | StartVisible | EndVisible |
---|---|---|---|
A | B | 1 | 0 |
B | C | 0 | 0 |
C | D | 0 | 1 |
D | E | 1 | 1 |
B | Q | 1 | 1 |
L | M | 1 | 0 |
M | Q | 0 | 1 |
I have added some more Nodes just to make things more interesting. Then i have populated Paths with all unique nodes.
One little wrinkle is that every start node must have an own unique row, so if you have multiple D-nodes, we need to have two rows, one for D-start, and one for D-nonstart, this is to disambiguate it later.
Your structure cannot have multiple D start nodes, because then you won't know which way to go. If you do have it, then it's probably possible to implement as well, but this code might not handle it correctly.
After nodes, we need to populate connections, for this one have to use a magic column called table.$node_id
which contains the path ID of each Node. Here, i also need to save info about the connection, because you're only interested in the last jump, for this reason we save the finish
/EndVisible
column.
Finally, the actual graph query:
SELECT *
FROM (
SELECT Paths1.name AS NodeStart
, paths1.start AS started
, STRING_AGG(Paths2.name_orig, '->') WITHIN GROUP (GRAPH PATH) AS NodePath
, LAST_VALUE(Paths2.name_orig) WITHIN GROUP (GRAPH PATH) AS NodeLast
, COUNT(Paths2.name_orig) WITHIN GROUP (GRAPH PATH) AS levels
, LAST_VALUE(fo.finish) WITHIN GROUP (GRAPH PATH) AS Finished
FROM Paths AS Paths1,
connects FOR PATH AS fo,
Paths FOR PATH AS Paths2
WHERE MATCH(SHORTEST_PATH(Paths1(-(fo)->Paths2)+))
) AS Q
WHERE started = 1
AND finished = 1
Graph table uses this weird syntax of arrows to define the "flow" of the data.
MATCH(SHORTEST_PATH(Paths1(-(fo)->Paths2)+))
means we want to have all relations between any Path1 and other Paths2 connected by connects(fo)
table. The magic of graph tables is that any number of jumps is followed here due to the +
-sign in the arrow query.
We also gather some useful data like LAST_VALUE(fo.finish) WITHIN GROUP (GRAPH PATH) AS Finished
which tell us if this long path ends with a EndNodeIsVisible = 1, which is what we want, as well as paths1.start AS started
which tells is if this path is from the beginning of the chain or not.
So finally, we make selection on both start and finish = 1 and this is what the query returns:
NodeStart | started | NodePath | NodeLast | levels | Finished |
---|---|---|---|---|---|
B | 1 | Q | Q | 1 | 1 |
D | 1 | E | E | 1 | 1 |
L | 1 | M->Q | Q | 2 | 1 |
A | 1 | B->C->D | D | 3 | 1 |
So, this returned us what we wanted, which is the path for each Start to End node.
Upvotes: 3
Reputation: 5157
Optimization suggestion:
If a substantial portion (~40%) of your nodes are invisible then you can first filter them out by copying only records with visible nodes into a new table.
SELECT *
INTO Paths2
FROM Paths
WHERE StartNodeIsVisible = 1 OR EndNodeIsVisible = 1
Then run your query against a new table.
Upvotes: 1
Reputation: 22811
You can check the end node visibility to stop the recursion
WITH RecursivePaths AS (
SELECT StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible, 1 AS Depth
FROM Paths
WHERE StartNodeIsVisible = 1
UNION ALL
SELECT rp.StartNode, p.EndNode, rp.StartNodeIsVisible, p.EndNodeIsVisible, Depth + 1
FROM RecursivePaths rp
JOIN Paths p ON rp.EndNode = p.StartNode AND rp.EndNodeIsVisible = 0
WHERE p.EndNodeIsVisible = 1 OR (p.EndNodeIsVisible = 0 AND Depth < 50)
)
SELECT *
FROM RecursivePaths
WHERE EndNodeIsVisible = 1
Upvotes: 0