user1330974
user1330974

Reputation: 2616

Get DISTINCT COUNT in one pass in SQL Server

I have a table like below:

Region    Country    Manufacturer    Brand    Period    Spend
R1        C1         M1              B1       2016      5
R1        C1         M1              B1       2017      10
R1        C1         M1              B1       2017      20
R1        C1         M1              B2       2016      15
R1        C1         M1              B3       2017      20
R1        C2         M1              B1       2017      5
R1        C2         M2              B4       2017      25
R1        C2         M2              B5       2017      30
R2        C3         M1              B1       2017      35
R2        C3         M2              B4       2017      40
R2        C3         M2              B5       2017      45
...

I wrote the query below to aggregate them:

SELECT [Region]
    ,[Country]
    ,[Manufacturer]
    ,[Brand]
    ,Period
    ,SUM([Spend]) AS [Spend]
FROM myTable
GROUP BY [Region]
    ,[Country]
    ,[Manufacturer]
    ,[Brand]
    ,[Period]
ORDER BY 1,2,3,4

which yields something like below:

Region    Country    Manufacturer    Brand    Period    Spend
R1        C1         M1              B1       2016      5
R1        C1         M1              B1       2017      30 -- this row is an aggregate from raw table above
R1        C1         M1              B2       2016      15
R1        C1         M1              B3       2017      20
R1        C2         M1              B1       2017      4  -- aggregated result
R1        C2         M2              B4       2017      25
R1        C2         M2              B5       2017      30
R2        C3         M2              B4       2017      40
R2        C3         M2              B5       2017      45

I'd like to add another column to the above table that shows the DISTINCT COUNT of Brand grouped by Region,Country,Manufacturer and Period. So the final table would become as follow:

Region    Country    Manufacturer    Brand    Period    Spend    UniqBrandCount
R1        C1         M1              B1       2016      5        2 -- two brands by R1, C1, M1 in 2016
R1        C1         M1              B1       2017      30       1
R1        C1         M1              B2       2016      15       2 -- same as first row's result
R1        C1         M1              B3       2017      20       1
R1        C2         M1              B1       2017      4        1
R1        C2         M2              B4       2017      25       2
R1        C2         M2              B5       2017      30       2
R2        C3         M2              B4       2017      40       2
R2        C3         M2              B5       2017      45       2

I know how to get to final result in three steps.

  1. Run this query (Query #1):

    SELECT [Region] ,[Country] ,[Manufacturer] ,[Period] ,COUNT(DISTINCT [Brand]) AS [BrandCount] INTO Temp1 FROM myTable GROUP BY [Region] ,[Country] ,[Manufacturer] ,[Period]

  2. Run this query (Query #2)

    SELECT [Region] ,[Country] ,[Manufacturer] ,[Brand] ,YEAR([Period]) AS Period ,SUM([Spend]) AS [Spend] INTO Temp2 FROM myTable GROUP BY [Region] ,[Country] ,[Manufacturer] ,[Brand] ,[Period]

  3. Then LEFT JOIN Temp2 and Temp1 to bring in [BrandCount] from the latter like below:

    SELECT a.* ,b.* FROM Temp2 AS a LEFT JOIN Temp1 AS b ON a.[Region] = b.[Region] AND a.[Country] = b.[Country] AND a.[Advertiser] = b.[Advertiser] AND a.[Period] = b.[Period]

I'm pretty sure there is a more efficient way to do this, is there? Thank you in advance for your suggestions/answers!

Upvotes: 2

Views: 455

Answers (3)

mjsqu
mjsqu

Reputation: 5452

Borrowing heavily from this question: https://dba.stackexchange.com/questions/89031/using-distinct-in-window-function-with-over

Count Distinct doesn't work, so dense_rank is required. Ranking the brands in forward and then reverse order, and then subtracting 1 gives the distinct count.

Your sum function can also be rewritten using PARTITION BY logic. This way you can use different grouping levels for each aggregation:

SELECT 
[Region]
,[Country]
,[Manufacturer]
,[Brand]
,[Period]
,dense_rank() OVER 
    (PARTITION BY 
     [Region] 
    ,[Country]
    ,[Manufacturer]
    ,[Period] Order by Brand) 
+ dense_rank() OVER 
    (PARTITION BY 
     [Region] 
    ,[Country]
    ,[Manufacturer]
    ,[Period] Order by Brand Desc) 
- 1  
AS [BrandCount]
,SUM([Spend]) OVER
    (PARTITION BY
     [Region] 
    ,[Country]
    ,[Manufacturer]
    ,[Brand]
    ,[Period]) as [Spend]
from
myTable
ORDER BY 1,2,3,4

You may then need to reduce the number of rows in your output, as this syntax gives the same number of rows as myTable, but with the aggregation totals appearing on each row they apply to:

R1  C1  M1  B1  2016    2   5
R1  C1  M1  B1  2017    2   30 --dup1
R1  C1  M1  B1  2017    2   30 --dup1
R1  C1  M1  B2  2016    2   15
R1  C1  M1  B3  2017    2   20
R1  C2  M1  B1  2017    1   5
R1  C2  M2  B4  2017    2   25
R1  C2  M2  B5  2017    2   30
R2  C3  M1  B1  2017    1   35
R2  C3  M2  B4  2017    2   40
R2  C3  M2  B5  2017    2   45

Selecting distinct rows from this output gives you what you need.

How the dense_rank trick works

Consider this data:

Col1    Col2
B       1
B       1
B       3
B       5
B       7
B       9

dense_rank() ranks data according to the number of distinct items before the current one, plus 1. So:

1->1, 3->2, 5->3, 7->4, 9->5.

In reverse order (using desc) this yields the reverse pattern:

1->5, 3->4, 5->3, 7->2, 9->1:

Adding these ranks together gives the same value:

1+5 = 2+4 = 3+3 = 4+2 = 5+1 = 6

The wording is helpful here,

(number of distinct items before + 1) + (number of distinct items after + 1) 
= number of distinct OTHER items before AND after + 2 
= Total number of distinct items + 1

So to get the total number of distinct items, add the ascending and descending dense_ranks together and subtract 1.

Upvotes: 2

Martin Smith
Martin Smith

Reputation: 453736

The double dense_rank idea means that you need two sorts (assuming no index exists that provides sort order). Assuming no NULL brands (as that idea does) you can use a single dense_rank and a windowed MAX as below (demo)

WITH T1
     AS (SELECT *,
                DENSE_RANK() OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period] ORDER BY Brand) AS [dr]
         FROM   myTable),
     T2
     AS (SELECT *,
                MAX([dr]) OVER (PARTITION BY [Region], [Country], [Manufacturer], [Period]) AS UniqBrandCount
         FROM   T1)
SELECT [Region],
       [Country],
       [Manufacturer],
       [Brand],
       Period,
       SUM([Spend])        AS [Spend],
       MAX(UniqBrandCount) AS UniqBrandCount
FROM   T2
GROUP  BY [Region],
          [Country],
          [Manufacturer],
          [Brand],
          [Period]
ORDER  BY [Region],
          [Country],
          [Manufacturer],
          [Period],
          Brand 

The above has some inevitable spooling (it isn't possible to do this in a 100% streaming manner) but a single sort.

Strangely the final order by clause is needed to keep the number of sorts down to one (or zero if a suitable index exists).

enter image description here

Upvotes: 1

Tolu
Tolu

Reputation: 175

The tag to your question;

window-functions

suggests you have a pretty good idea.

For DISTINCT COUNT of Brand grouped by Region,Country,Manufacturer and Period: you may write:

Select   Region 
        ,Country
        ,Manufacturer
        ,Brand
        ,Period
        ,Spend
        ,DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand asc) 
         + DENSE_RANK() Over (Partition By Region, Country, Manufacturer, Period Order By Brand desc) 
         -1 UniqBrandCount
From myTable T1
Order By 1,2,3,4

Upvotes: 2

Related Questions