CJ Dennis
CJ Dennis

Reputation: 4346

How can I prevent SQL Server from squaring the number of rows scanned?

I'm running a query over a table variable that holds 22 227 rows. The query used to take 2-3 seconds to complete (which I still think is too slow) but since I added another field to the ORDER BY clause in DENSE_RANK() it now completes in 4.5 minutes!

If I include [t2].[aisdt] with or without [t2].[aiID], the execution plan shows that it's scanning 494 039 529 rows, which is 22 227 squared. The following query generates the correct results, just much too slowly to be useful.

SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID]) AS [SetNum]
          ,[t2].*
    FROM (
        SELECT [aiID]
              ,COUNT(DISTINCT [acID]) AS [noac]
        FROM @Temp
        GROUP BY [aiID]
    ) [t1]
    JOIN @Temp [t2]
      ON [t2].[aiID] = [t1].[aiID]
    WHERE [t1].[noac] < [t2].[asm]
) [t]

Just to be clear, the culprit is the bold section in "DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID])". Removing this field (which needs to remain) drops the execution time back down to 2-3 seconds. I think it might have something to do with JOINing the table to itself on [aiID] but not [aisdt].

How can I speed this query up to complete in the same time as before, or less?


EDIT

Table definition:

DECLARE @Temp TABLE (
    [aiID] INT NOT NULL INDEX [IX_Temp_aiID] -- not unique
    ,[aisdt] DATETIME NOT NULL INDEX [IX_Temp_aisdt] -- not unique
    ,[asm] INT NOT NULL
    ,[cpcID] INT NULL
    ,[cpce] VARCHAR(10) NULL
    ,[acID] INT NULL
    ,[ctvID] INT NULL
    ,[ct] VARCHAR(100) NULL
    ,[_36_other_non_matched_fields_] VARCHAR(MAX)

    ,UNIQUE ([aiID], [cpcID], [cpce], [acID], [ctvID], [ct])
)

[aisdt] is unique per [aiID], but there can be multiple [aiID]s with the same [aisdt].

INSERT INTO @TEMP
VALUES (64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 47, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 47, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah')

It must be sorted by [aisdt] (datetime) first, then [aiID], then numbered into sets based on [aiID].

I want to see:

5, 1, 54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 1, 54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah'

Upvotes: 2

Views: 133

Answers (2)

Vladimir Baranov
Vladimir Baranov

Reputation: 32693

The main idea is taken from Partition Function COUNT() OVER possible using DISTINCT that @Jayvee pointed out with a small addition that would make it work when acID has NULL values.

Most likely you can remove all indexes from your @Temp table, the server will have to sort it in several different ways for different window functions anyway, but there is no self-join, so it should be faster.

The plan will have many sorts and they also can be slow, especially when engine underestimates the number of rows in a table. And table variable is exactly this case. Optimiser thinks that table variable has only 1 row. So, I'd recommend to use a classic #Temp table here, even without indexes.

An index on (aiID, acID) should help, but there will be other sorts any way.

WITH
CTE_Counts
AS
(
    SELECT
        *
        -- use DENSE_RANK() to calculate COUNT(DISTINCT)
        , DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID])
        + DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID] DESC)
        -- subtract extra 1 if acID has NULL values within the partition
        - MAX(CASE WHEN [acID] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [aiID])
        - 1 AS [noac]
    FROM @Temp
)
,CTE_SetNum
AS
(
    SELECT
        *
        , DENSE_RANK() OVER (ORDER BY [aisdt], [aiID]) AS [SetNum]
    FROM CTE_Counts
    WHERE [noac] < [asm]
)
SELECT
    *
    , MAX([SetNum]) OVER () AS [MaxSet]
FROM CTE_SetNum
ORDER BY
    [aisdt]
    ,[aiID]
    ,[SetNum]
;

Upvotes: 2

Jayvee
Jayvee

Reputation: 10875

Index as suggested in the comments would definitely play a major part but also I think you can re-write the query without self join in this way:

   SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t1].[aisdt], [t1].[aiID]) AS [SetNum]
          ,[t1].*
    FROM (
        SELECT * ,dense_rank() over(partition by aiID order by [acID]) - 
        dense_rank() over(partition by aiID order by [acID]) - 1 AS [noac]
        FROM @Temp
    ) [t1]
    WHERE [t1].[noac] < [t1].[asm]
) [t]

Upvotes: 1

Related Questions