mBrice1024
mBrice1024

Reputation: 838

SQL Server - Group by day for the top N of the range

What I need to do is get a Cost breakout for each grouping, aggregated by day. Also, only taking the top N per the whole date range. I'm probably not explaining this well so let me give examples. Say my table schema and data looks like this:

SoldDate                  Product               State     Cost  
-----------------------  --------------------- --------- ------
2017-07-11 01:00:00.000  Apple                 NY         6
2017-07-11 07:00:00.000  Banana                NY         1
2017-07-11 07:00:00.000  Banana                NY         1
2017-07-12 01:00:00.000  Pear                  NY         2
2017-07-12 03:00:00.000  Olive                 TX         1
2017-07-12 16:00:00.000  Banana                NY         1
2017-07-13 22:00:00.000  Apple                 NY         6
2017-07-13 22:00:00.000  Apple                 NY         6
2017-07-13 23:00:00.000  Banana                NY         1

Call this table SoldProduce. Now what I'm looking for is to group by Day, Product and State but for each day, only take the top two of the group NOT the top of that particular day. Anything else gets lumped under 'other'. So in this case, our top two groups with the greatest Cost are Apple-NY and Banana-NY. So those are the two that should show up in the output only. Anything else is under 'Other'

So in the end this is the desired output:

SoldDay                  Product               State     Total Cost  
-----------------------  --------------------- --------- ------
2017-07-11 00:00:00.000  Apple                 NY         6
2017-07-11 00:00:00.000  Banana                NY         2
2017-07-11 00:00:00.000  OTHER                 OTHER      0
2017-07-12 00:00:00.000  OTHER                 OTHER      3
2017-07-12 00:00:00.000  Banana                NY         1
2017-07-13 00:00:00.000  Apple                 NY         12
2017-07-13 00:00:00.000  Banana                NY         1
2017-07-13 00:00:00.000  OTHER                 OTHER      0

Note how on the 12th Pear and Olive were lumped under other. Even though it outsold Banana on that day. This is because I want the Top N selling groups for the whole range, not just on a day by day basis.

I did a lot of googleing a way to make a query to get this data but I'm not sure if it's the best way:

 WITH TopX AS
    (
        SELECT 
            b.Product, 
            b.State, 
            b.SoldDate, 
            b.Cost,
            DENSE_RANK() OVER (ORDER BY GroupedCost DESC) as [Rank]
        FROM
        (
            SELECT 
                b.Product, 
                b.State, 
                b.SoldDate, 
                b.Cost, 
                SUM(b.Cost) OVER (PARTITION BY b.Product, b.State) as GroupedCost
            FROM 
            SoldProduce b WITH (NOLOCK)  
        ) as b
    )

    SELECT 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0), 
        b.Product, 
        b.State, 
        SUM(b.Cost)
    FROM 
        TopX b
    WHERE 
        [Rank] <= 2
    GROUP BY 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0), 
        b.Product, 
        b.State
    UNION ALL
    SELECT 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0), 
        null, 
        null, 
        SUM(b.Cost)
    from 
        TopX b
    WHERE 
        [Rank] > 2
    GROUP BY 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0)

Step 1) Create a common query that first projects the cost that the row would be has we just grouped by Product and State. Then it does a second projection to rank that cost 1-N where 1 has the greatest grouped cost.

Step 2) Call upon the common query, grouping by day and restricting to rows <= 2. This is the Top elements. Then union the other category to this, or anything ranked > 2. 

What do you guys think? Is this an efficient solution? Could I do this better?

Edit:

FuzzyTrees suggestion benchmarks better than mine.

Final query used:

    WITH TopX AS
    (
        SELECT 
            TOP(2)
            b.Product, 
            b.State
        FROM
            SoldProduce b
        GROUP BY 
            b.Product, 
            b.State
        ORDER BY
            SUM(b.Cost)
    )

    SELECT 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0), 
        coalesce(b.Product, 'Other') Product, 
        coalesce(b.State, 'Other') State, 
        SUM(b.Cost)
    FROM 
        SoldProduce a
        LEFT JOIN TopX b ON
            (a.Product = b.Product OR (a.Product IS NULL AND b.Product IS NULL)) AND
            (a.State = b.State OR (a.State IS NULL AND b.State IS NULL))
    GROUP BY 
        DATEADD(d,DATEDIFF(d,0,SoldDate),0), 
        coalesce(b.Product, 'Other') Product, 
        coalesce(b.State, 'Other') State, 
    ORDER BY DATEADD(d,DATEDIFF(d,0,SoldDate),0) 
    -- Order by optional. Just for display purposes. 
    --More effienct to order in code for the final product. 
    --Don't use I/O if you don't have to :)

Upvotes: 2

Views: 218

Answers (1)

FuzzyTree
FuzzyTree

Reputation: 32392

I suggest using a plain group by without window functions for your TopX view:

With TopX AS
(
    select top 2 Product, State
    from SoldProduce
    group by Product, State
    order by sum(cost) desc
)

Then you can left join to your TopX view and use coalesce to determine which products fall into the Other group

select
    coalesce(TopX.Product, 'Other') Product,
    coalesce(TopX.State, 'Other') State,
    sum(Cost),
    sp.SoldDate
from SoldProduce sp
left join TopX on TopX.Product = sp.Product
    and TopX.State = sp.State
group by 
    coalesce(TopX.Product, 'Other'),
    coalesce(TopX.State, 'Other'),
    SoldDate
order by SoldDate

Note: This query will not return 0 counts

Upvotes: 1

Related Questions