Reputation: 21

Recursive CTE performing slower than hard-coding the recursive actions

I have a list of notifications (notes) that get pinged to users for review. These notes can be linked to each other in a parent/child manner, where one notification is the child of an earlier notification. Not every note has a parent. In all, there are about 22,000 records.

Note  Parent Note   
1     NULL   
2     NULL
3     1
4     NULL  
5     NULL
6     3 
7     4

I would like to build a tree of these notifications, tracing back to the start to show just how deep the "child" notifications are, and how many other notifications are applied to this particular one. The desired output of the above table would look something like this.

Note  Parent Note  Level  Full List
1     NULL         1      1
2     NULL         1      2
3     1            2      1/3
4     NULL         1      4
5     NULL         1      5
6     3            3      1/3/6
7     4            2      4/7

Each time a new child notification is opened, the level increases by one, and it is concatenated to the previous list of notifications.

I tried to use a recursive CTE to accomplish this. Here's what the query looks like.

WITH CTE AS (
SELECT Note,
       Parent_Note,
       1 as Level,
       --Cast as nvarchar to keep data types the same
       CAST(QNote as nvarchar(MAX)) as Full_List
FROM Notifications
WHERE Parent_Note IS NULL
UNION ALL
SELECT Notifications.Note,
       CTE.Note as Parent_Note,
       Level = CTE.Level + 1,
       CAST(Full_List + '/' + Notifications.Note as nvarchar(MAX)) as Full_List
FROM CTE INNER JOIN Notifications ON CTE.Note = Notifications.Parent_Note)

SELECT * FROM CTE

Unfortunately, this query takes about 15 minutes to get through 5 records at the second "level" of notifications.

If I hard-code each recursion, however, I can get the full data set to load in less than 30 seconds.

WITH CTE1 AS (
SELECT Note,
       Parent_Note,
       1 as Level,
       --Cast as nvarchar to keep data types the same
       CAST(QNote as nvarchar(MAX)) as Full_List
FROM Notifications
WHERE Parent_Note IS NULL
),

CTE2 AS (
SELECT Notifications.Note,
       CTE1.Note as Parent_Note,
       Level = CTE1.Level + 1,
       CAST(Full_List + '/' + Notifications.Note as nvarchar(MAX)) as Full_List
FROM CTE1 INNER JOIN Notifications ON CTE1.Note = Notifications.Parent_Note
),

CTE3 AS (
SELECT Notifications.Note,
       CTE2.Note as Parent_Note,
       Level = CTE2.Level + 1,
       CAST(Full_List + '/' + Notifications.Note as nvarchar(MAX)) as Full_List
FROM CTE2 INNER JOIN Notifications ON CTE2.Note = Notifications.Parent_Note
)

SELECT * FROM CTE1 UNION SELECT * FROM CTE2 UNION SELECT * FROM CTE3

I don't quite understand why the recursive query is so slow to the point where it doesn't load, while the hard-coded query loads the data in less than 30 seconds. I also don't want to use the hard-coded query, because I am unsure of how many "levels" will end up existing, though the current maximum is just six.

Any information anyone could share would be appreciated, and though I may not be able to fully provide information to help answer the question (asking this question for queries at work, so can't share data/query plans), I will definitely provide what I can.

Upvotes: 2

Answers (2)

Roger Wolf

Reputation: 7712

If you receive this kind of performance on just 22K rows, I would say there is something very wrong with your table schema. Is your table a heap, by any chance?

Here is a test setup I performed, with 100K rows of mock data:

-- Table with clustered PK and parent-child FK
create table dbo.Notes (
  Id int identity(1,1) primary key,
  ParentId int null references dbo.Notes(Id)
);
go
-- Generate data
insert into dbo.Notes (ParentId)
select top (100000) null
from sys.all_objects a, sys.all_objects b;
go
-- Introduce random parent-child hierarchy between rows
update sq set ParentId = sq.NewParent
from (
  select n.*,
    case when left(cast(newid() as binary(16)), 1) < 0xC0 then 1 else 0 end as [HasParent],
    abs(nullif(checksum(newid()) % (n.Id - 1), 0)) as [NewParent]
  from dbo.Notes n
  where n.Id > 1
    and n.ParentId is null
) sq
where sq.NewParent > 0
  and sq.HasParent = 1;
go
-- Create an index on ParentId
create index IX_Notes_ParentId on dbo.Notes (ParentId);
go

Here is the CTE query I was testing:

set statistics time on;
go
set statistics io on;
go

with cte as (
  select n.*, cast(n.Id as varchar(max)) as [NotePath]
  from dbo.Notes n where n.ParentId is null
  union all
  select n.*, c.NotePath + '/' + cast(n.Id as varchar(max))
  from dbo.Notes n
    inner join cte c on c.Id = n.ParentId
)
select c.*
from cte c
order by c.Id
option (recompile);
go

set statistics time off;
go
set statistics io off;
go

Here is time and CPU without the index on the ParentId column:

(100000 rows affected)

Table 'Worktable'. Scan count 100002, logical reads 1295309, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Notes'. Scan count 2, logical reads 426, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 1032 ms, elapsed time = 1522 ms.

And here is with the index:

(100000 rows affected)

Table 'Worktable'. Scan count 2, logical reads 736852, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Notes'. Scan count 100001, logical reads 200338, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 781 ms, elapsed time = 1313 ms.

As you can see, the difference is not all that significant between the two, which leads to the conclusion that it's the clustered index on the Id column that is important.

P.S. I have also tried the (ParentId, Id) index suggested by others, and it didn't show any statistically significant difference with the one that covers ParentId alone. Which is the expected behaviour because all nonclustered indices always include the entries from the clustered one as references; there is no need to add clustered index' columns into them, except for some edge cases.

Upvotes: 1

The Impaler

Reputation: 48850

I would make sure the table has the following index:

create index ix2 on notifications (parent_note, note);

Upvotes: 1

Recursive CTE performing slower than hard-coding the recursive actions

Answers (2)

Related Questions