SQL Server : Cursor to CTE almost there

I have a cursor, that I want to turn into a CTE.

The thing I want to avoid is, the looping logic. What it does is, it gets back all particular 'folders' and then, looping it checks if the new folderid = oldfolderid, if so, set DupCheck = 1 otherwise set DupCheck = 0.

As it loops, it doesn't set DupCheck to 1 the first time the folderid changes, just on each subsequent folder until the folderid changes again. In this way, only the first record of that set of folderids is DupCheck=0, all the subsequent ones are 1 (because they are duplicates)... make sense?

I am not quite able to see the solution exactly. Need a little help.

Here is the old cursor and what I have for the new CTE so far.

Old code (cursor):

-- old cursor
declare @DFolderId int, @DItemID int
declare @Dname varchar(100)     
declare @DFolderIdNew int
set @DFolderIdNew = 0

declare CurDup cursor for
    select FolderID, EntityName, ItemID  
    from @TempTable 
    where FolderID in (select FolderID 
                       from @TempTable 
                       group by FolderID 
                       having count(*) > 1) 
    order by FolderID, EntityType

open CurDup

fetch next from CurDup into @DFolderId, @Dname, @DItemID

while @@FETCH_STATUS = 0
begin
    if @DFolderIdNew = @DFolderId
    begin
        update @TempTable 
        set DupCheck = 1 
        where FolderID = @DFolderId and ItemID = @DItemID                            
    end
    else 
    begin                             
        update @TempTable 
        set DupCheck = 0 
        where FolderID = @DFolderId and ItemID = @DItemID                          
    end

    set @DFolderIdNew = @DFolderId

    fetch next from CurDup into @DFolderId, @Dname, @DItemID
end

close CurDup
deallocate CurDup

CTE replacement

-- CTE replacement for cursor
;WITH cteCurDup (FolderID, EntityName, EntityType, ItemID, RowNum) AS
(
    SELECT 
        FolderID, EntityName, Entitytype, ItemID, 
        ROW_NUMBER() OVER (ORDER BY FolderID, EntityType) AS RowNum
    FROM 
        @TempTable 
    WHERE 
        FolderID IN (SELECT FolderID 
                     FROM @TempTable 
                     GROUP BY FolderID 
                     HAVING COUNT(*) > 1)
        -- AND RowNum > 1
)
UPDATE @TempTable 
SET DupCheck = 1 
WHERE ItemID IN (SELECT ItemID FROM cteCurDup WHERE RowNum > 1)

Upvotes: 4

Answers (4)

Steve

Reputation: 960

@BeauD'Amore, just some more background following on from the comments on your answer.

The reason why set-based code is usually more performant over time is because the query engine can choose how to execute the query and is able to optimise its choices (including keeping it optimised as database conditions change, for example if the balance of rows in different tables changes over time, or if an index is added).

I would say that a programmer primarily understands the concepts of what he is trying to achieve and prefers to write the code in a way that conforms to those concepts (or conforms to a familiar pattern that is known to achieve what he wants), rather than concerning himself too much with what is computationally efficient. This is not least so he can maintain the code later without bending his mind around an optimisation that distorts the concepts involved, and can modify it later without completely having to redo the code in order to unwind from an optimisation that was only narrowly applicable and is not reconcilable with the desired modification. This is the same reason why we don't write in assembler anymore.

Given a query, the query engine may make execution choices that would seem absurd or unfathomably complex, but can nevertheless be shown by theory to still have logical equivalence to what the programmer has written, and can be shown by the statistics of previous executions to be a more efficient approach - and it can try a number of plausible approaches, and learn from past experience, before settling on a particular plan for a particular query.

The functional style of set-based code also tends to be terser and it's permutations more easily understood (once the programmer is familiar with the language), and is often closer to how humans tend to optimise familiar tasks in the real world (for example, you do not go between the supermarket and home once per item, you buy all items at once and travel once).

You note that you haven't seen any improvement yet. None is guaranteed, but one of the possible explanations for this is that converting individual cursors to set-based code, but still maintaining a series of separate statements with assignments to intermediate temporary tables, still frustrates the query optimiser. The query engine will silently generate temporary tables itself if it deems necessary, but if you impose them as part of the code actually written, then it has no choice in the matter even if it would otherwise decide to do without. And crucially, it does not try to analyse the separate statements either side of the explicit temporary table to see if they can be boiled down together into something more efficient.

Thus any improvement in performance, if there is any to be had, tends to come in a big bang once all code has been converted into a single statement (using WITH clauses if necessary, to allow the programmer to organise the code conceptually into steps, but which the query engine is not obliged to obey as part of its actual execution).

Another advantage once code is set-based, and amenable to analysis by the query engine, is that you get a comprehensive query plan and associated statistics of the entire procedure, which shows you which logical operations are actually causing all the pain and why. The fact that the query engine knows that these pain points exist, and hasn't been able to do anything to improve them, can then give the programmer insights to how the database or the query can be actually redefined in order to ease them (adding indexes, creating summary tables, archiving old data, adding extra constraints into the query that allow a smaller set of data to be considered, etc.).

I also wrote an answer recently to help another user addressing much the same subject. It's probably a bit long-winded (like my thoughts in this answer I suppose!) but the upshot was that the user reported that the execution time of a fairly complex query ultimately fell from 17 seconds to about 2 by doing nothing more than eliminating imperative code (which also made the code patently clearer and shorter).

https://stackoverflow.com/a/47985111/9129668

Upvotes: 1

Beau D'Amore

Reputation: 3392

Thank you both.

@SerkanArslan I tried your way and another way I came up with in the interim between posting and receiving.

;With cteCurDup (FolderID, EntityName, EntityType, ItemID, RowNum )
as 
(
SELECT FolderID, EntityName, Entitytype, ItemID, ROW_NUMBER() OVER (PARTITION BY FolderID ORDER BY FolderId, EntityType) AS RowNum
FROM @TempTable 
where FolderID in (select FolderID from @TempTable group by FolderID having count(*) > 1 )  
)
Update @TempTable set DupCheck=1 
where ItemID in (SELECT ItemID FROM cteCurDup where RowNum > 1)

However, the absurd thing about BOTH is that NEITHER is faster than the original cursor... (our indexes are another issue)

(top is new, bottom is old, from SQL profiler)

Upvotes: 0

Steve

Reputation: 960

From what I can tell, you're marking any second FolderIDs as duplicates.

The following is my take on it, although I see @SerkanArslan has already posted while I was looking at your code.

WITH folders_and_items AS
(
    SELECT 
        FolderID
        ,EntityName
        ,ItemID
        ,IIF(ROW_NUMBER() OVER (PARTITION BY FolderID ORDER BY EntityType) > 1, 1, 0) AS new_dup_check

    FROM 
        @TempTable 
)

UPDATE 
    folders_and_items
SET 
    DupCheck = new_dup_check

As an aside, I can never understand why anyone thinks writing with cursors is the easy way - they're an absolute dogs dinner compared to set-based solutions!

EDIT: I've also left out the logic which only updates duplicate records - but realistically, it may be easier and more reliable just to update the DupCheck mark on the entire table and leave the unique-filtering logic out, since it's a temp table anyway.

Upvotes: 2

Serkan Arslan

Reputation: 13393

You can use this.

;With cteCurDup (FolderID, EntityName, EntityType, ItemID, DupCheck, RowNum, RowCnt )
as 
(
    SELECT FolderID, EntityName, Entitytype, ItemID, DupCheck,
        ROW_NUMBER() OVER (PARTITION BY FolderID ORDER BY EntityType) AS RowNum,
        COUNT(*) OVER (PARTITION BY FolderID ) AS RowCnt
    FROM @TempTable 
)
Update cteCurDup set DupCheck= (CASE WHEN RowNum = 1 THEN 0 ELSE 1 END)
where RowCnt > 1

Upvotes: 3

SQL Server : Cursor to CTE almost there

Answers (4)

Related Questions