Shahar Shokrani
Shahar Shokrani

Reputation: 8762

How to Compress Paths in SQL Server Using Recursive CTE with Node Visibility Conditions?

I'm working with a dataset in SQL Server that represents paths between nodes, where each node has a visibility flag (IsVisible). My goal is to compress these paths by skipping over nodes marked as not visible (IsVisible = 0) and directly connecting the visible nodes. For example, given a path like A -> B(false) -> C(false) -> D, I want to transform it into A -> D.

Each path is represented as a pair of start and end nodes in my table, along with visibility flags for both the start and end nodes. Here's the structure of my table and some example data:

CREATE TABLE Paths (
    StartNode CHAR(1),
    EndNode CHAR(1),
    StartNodeIsVisible BIT,
    EndNodeIsVisible BIT
);

INSERT INTO Paths (StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible) VALUES
('A', 'B', 1, 0),
('B', 'C', 0, 0),
('C', 'D', 0, 1);
('D', 'E', 1, 1);

Desired result:

| StartNode | EndNode | Depth(optional) |
|-----------|---------|-----------------|
| A         | D       |   3             |
| D         | E       |   1?            |

I attempted to use a recursive Common Table Expression (CTE) to achieve this path compression, but I'm encountering difficulties in correctly skipping over the non-visible nodes and directly connecting the visible ones. Here's what I've tried:

WITH RecursivePaths AS (
    SELECT StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible, 1 AS Depth
    FROM Paths
    WHERE StartNodeIsVisible = 1

    UNION ALL

    SELECT rp.StartNode, p.EndNode, rp.StartNodeIsVisible, p.EndNodeIsVisible, Depth + 1
    FROM RecursivePaths rp
    JOIN Paths p ON rp.EndNode = p.StartNode
    WHERE p.EndNodeIsVisible = 1 OR (p.EndNodeIsVisible = 0 AND Depth < 50)
)
SELECT *
FROM RecursivePaths
WHERE EndNodeIsVisible = 1
OPTION (MAXRECURSION 200);

I've added the OPTION (MAXRECURSION 200) because I was getting:

The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

While this code works for small dataset, in scale of ~15,000,000 rows in paths table, it making a lot of reads: ~600,000,000 (I've needed to stopped the it in the middle).

Questions:

  1. How can I adjust my recursive CTE to correctly compress the paths by only including paths between visible nodes, effectively skipping any intermediate non-visible nodes?
  2. Is there a more efficient way to write this query to handle large datasets with many paths? (I'm open to every possible solution)

Upvotes: 0

Views: 131

Answers (3)

siggemannen
siggemannen

Reputation: 9272

Here's an alternative suggestion, which makes use of Graph tables, which is a feature available in SQL Server 2017 i think.

Graph tables allows inferring relationships between rows, path traversals etc etc, you can read about them here https://learn.microsoft.com/en-us/sql/relational-databases/graphs/sql-graph-overview?view=sql-server-ver16

Graph tables have nodes, which is what you already have and edges which is connection between nodes, you have an implied connection inside the table, but we need a separate table which is very easy to create.

I cannot test the performance of this, but it might be interesting to try out on your real data.

Table population script:

SELECT  *
INTO    #t
FROM    (
    VALUES
    ('A', 'B', 1, 0),
    ('B', 'C', 0, 0),
    ('C', 'D', 0, 1),
    ('D', 'E', 1, 1),
    ('B', 'Q', 1, 1),
    ('L', 'M', 1, 0),
    ('M', 'Q', 0, 1)
) x (Node, Node2, StartVisible, EndVisible)

CREATE TABLE Paths (
    ID INTEGER IDENTITY PRIMARY KEY, name VARCHAR(10), name_orig varchar(10), start bit)
AS NODE;
 

CREATE TABLE connects (start bit, finish bit) AS EDGE;

INSERT INTO Paths
SELECT  concat(node, NULLIF(row_number() OVER(partition BY node ORDER BY startvisible DESC), 1)), node, startvisible
FROM    (
    SELECT  node, startvisible
    FROM    #t
    UNION 
    SELECT  node2, 0
    FROM    #t
    ) x
    
INSERT INTO connects
SELECT  pf.$node_id, pT.$node_id, t.startVisible,  EndVisible
FROM    #t t
INNER JOIN Paths pF
    ON  pF.name_orig = t.Node
    AND pf.start = t.startVisible
INNER JOIN Paths pT
    ON  pT.name_orig= t.Node2
    AND pT.start = 0
Node Node2 StartVisible EndVisible
A B 1 0
B C 0 0
C D 0 1
D E 1 1
B Q 1 1
L M 1 0
M Q 0 1

I have added some more Nodes just to make things more interesting. Then i have populated Paths with all unique nodes.

One little wrinkle is that every start node must have an own unique row, so if you have multiple D-nodes, we need to have two rows, one for D-start, and one for D-nonstart, this is to disambiguate it later.

Your structure cannot have multiple D start nodes, because then you won't know which way to go. If you do have it, then it's probably possible to implement as well, but this code might not handle it correctly.

After nodes, we need to populate connections, for this one have to use a magic column called table.$node_id which contains the path ID of each Node. Here, i also need to save info about the connection, because you're only interested in the last jump, for this reason we save the finish/EndVisible column.

Finally, the actual graph query:

SELECT  *
FROM    (
    SELECT  Paths1.name AS NodeStart
        ,   paths1.start AS started
        ,   STRING_AGG(Paths2.name_orig, '->') WITHIN GROUP (GRAPH PATH) AS NodePath
        ,   LAST_VALUE(Paths2.name_orig) WITHIN GROUP (GRAPH PATH) AS NodeLast
        ,   COUNT(Paths2.name_orig) WITHIN GROUP (GRAPH PATH) AS levels
        ,   LAST_VALUE(fo.finish) WITHIN GROUP (GRAPH PATH) AS Finished
    FROM    Paths AS Paths1,
        connects FOR PATH AS fo,
        Paths FOR PATH AS Paths2
    WHERE   MATCH(SHORTEST_PATH(Paths1(-(fo)->Paths2)+))
    ) AS Q
WHERE   started = 1
AND finished = 1

Graph table uses this weird syntax of arrows to define the "flow" of the data. MATCH(SHORTEST_PATH(Paths1(-(fo)->Paths2)+)) means we want to have all relations between any Path1 and other Paths2 connected by connects(fo) table. The magic of graph tables is that any number of jumps is followed here due to the +-sign in the arrow query.

We also gather some useful data like LAST_VALUE(fo.finish) WITHIN GROUP (GRAPH PATH) AS Finished which tell us if this long path ends with a EndNodeIsVisible = 1, which is what we want, as well as paths1.start AS started which tells is if this path is from the beginning of the chain or not.

So finally, we make selection on both start and finish = 1 and this is what the query returns:

NodeStart started NodePath NodeLast levels Finished
B 1 Q Q 1 1
D 1 E E 1 1
L 1 M->Q Q 2 1
A 1 B->C->D D 3 1

So, this returned us what we wanted, which is the path for each Start to End node.

Upvotes: 3

Alex
Alex

Reputation: 5157

Optimization suggestion:

If a substantial portion (~40%) of your nodes are invisible then you can first filter them out by copying only records with visible nodes into a new table.

SELECT *
INTO Paths2
FROM Paths
WHERE StartNodeIsVisible = 1 OR EndNodeIsVisible = 1

Then run your query against a new table.

Upvotes: 1

Serg
Serg

Reputation: 22811

You can check the end node visibility to stop the recursion

WITH RecursivePaths AS (
    SELECT StartNode, EndNode, StartNodeIsVisible, EndNodeIsVisible, 1 AS Depth
    FROM Paths
    WHERE StartNodeIsVisible = 1

    UNION ALL

    SELECT rp.StartNode, p.EndNode, rp.StartNodeIsVisible, p.EndNodeIsVisible, Depth + 1
    FROM RecursivePaths rp
    JOIN Paths p ON rp.EndNode = p.StartNode AND rp.EndNodeIsVisible = 0
    WHERE p.EndNodeIsVisible = 1 OR (p.EndNodeIsVisible = 0 AND Depth < 50)
)
SELECT *
FROM RecursivePaths
WHERE EndNodeIsVisible = 1

db<>fiddle

Upvotes: 0

Related Questions