ask_Overflow
ask_Overflow

Reputation: 275

Greatest N per Group in SQL Server

I want to find classical Greatest N per Group. I have two methods to solve this problem

  1. Dense_rank Over() method
  2. Min Over() method

Both work flawlessly. Now I want to find which one is better and why.

Sample data:

CREATE TABLE #test
    (
        id    INT,
        NAME  VARCHAR(50),
        dates DATETIME
    )

;WITH cte
        AS (SELECT TOP (100000 ) n = ( ( Row_number()OVER (ORDER BY a.number) - 1 ) / 3 ) + 1
            FROM   [master]..spt_values a
                CROSS JOIN [master]..spt_values b)
INSERT INTO #test
SELECT n,
        Getdate() + n
FROM   cte 

Dense_rank Over() Method:

;WITH cte
     AS (SELECT Dense_rank()OVER(partition BY NAME ORDER BY dates) AS rn,*
         FROM   #test)
SELECT id,
       NAME,
       dates
FROM   cte
WHERE  rn = 1;

Min Over() Method:

WITH cte
     AS (SELECT Min(dates)OVER(partition BY NAME) AS max_date,*
         FROM   #test)
SELECT id,
       NAME,
       dates
FROM   cte
WHERE  max_date = dates 

To compare the performance I checked on execution plan which said both query cost is 50%. But Max Over method execution plan looks little complex. So any guidance here will be helpful. I don't have very good understanding in execution plan's.

Execution Plan:(6 Records)

enter image description here

Execution plan:(100000 records)

For 100000 records the execution plan says

Dense_rank Over() Query cost : 46% 
Min Over() Query cost        : 54%

enter image description here

Upvotes: 1

Views: 1216

Answers (2)

jkdba
jkdba

Reputation: 2519

When you are doing this kind of analysis be sure to enable statistics, this data goes hand and hand with the execution plan and will bring more meaningful data to light.

set statistics io on and set statistics time on (just don't forget to turn them off. )

You will want to look at logical reads and scans, typically the less reads the more performing the query however, this doesn't mean it always will if the query doesn't scale.

I ran these with the stats I mentioned and Dense_Rank is clearly doing way less work and is over 100 ms faster.

Dense_Rank:

 CPU time = 156 ms, elapsed time = 211 ms. 
 Table 'test'. Scan count 13, logical reads 359 

Min:

 CPU time = 389 ms, elapsed time = 317 ms 
 Table 'test'. Scan count 13, logical reads 359 
 Table 'Worktable'. Scan count 25, logical reads 203028

Upvotes: 4

Gordon Linoff
Gordon Linoff

Reputation: 1271231

Here are some notes, which are too long for comments:

  • First, a bit of cognitive dissonance, because you are calculating the min() but asking about the max(). (The two should be equivalent in terms of performance.)
  • You really need to test on a larger amount of data. Evaluating execution plans on small amounts is misleading.
  • The two methods are not exactly the same. If you have NULL values in a dates, then the results could even be different (max() over . . . could return no rows, where dense_rank() will always return at least one row).
  • Hence, the execution plans should be different.
  • I would expect the two to be essentially the same in terms of performance.
  • Both methods should take advantage of an index on #test(name, dates), if the optimizer thinks this is desirable.

So, if you want a real answer to this question, generate a bunch of data (say, at least 100,000 rows) and look at the execution plans on that data.

Upvotes: 4

Related Questions