Tasos
Tasos

Reputation: 7577

Get the groups items after a GROUP BY in SQL Server

I have a table that has these columns:

UserID1, UserID2, ProductID, PurchaseDate

The following query run in a purchases table and return couple of users that have more than a number of interactions between them, regardless the order in the last 31 days:

DECLARE @threshold AS INT
DECLARE @days AS INT

SET @threshold = 10
SET @days = 31

SELECT 
    UserID1, UserID2, COUNT(*) AS Counter
FROM 
    (SELECT
        --do this to revert columns and count as one case both Col1,Col2 and Col2,Col1
        CASE 
           WHEN UserID1 < UserID2 
              THEN UserID1 
              ELSE UserID2 
        END AS UserID1,
        CASE 
           WHEN UserID1 < UserID2 
              THEN UserID2 
              ELSE UserID1 
        END AS UserID2
    FROM
        Purchases WITH(NOLOCK)
    WHERE 
        Deadline BETWEEN DATEADD(day, -@days, GETDATE()) AND GETDATE()) t
GROUP BY 
    UserID1, UserID2
HAVING 
    COUNT(*) > @threshold

Yields to:

UserID1  UserID2  Counter
1        2        10
3        2        5
4        1        8

However, what I want is to return a table with the ProductID and the PurchaseDate in separate rows like this

UserID1  UserID2  ProductID  PurchaseDate
1        2        12345      2017-01-18 00:13:52
1        2        5425       2017-01-12 15:10:02
1        2        64362      2017-01-05 10:10:02
..... for the 10 interactions
3        2        25235      2017-01-18 00:13:52
3        2        436346     2017-01-14 00:13:52
..... for the 5 interactions
4        1        23523      2017-01-14 00:13:52
4        1        135135     2017-01-09 00:13:52
..... for the 8 interactions

Is there any way without putting the results of the first query in a temp table and then join it again with the Purchases table to find all the purchases?

Upvotes: 1

Views: 118

Answers (2)

Vladimir Baranov
Vladimir Baranov

Reputation: 32695

If I understood you correctly, then simple windowed COUNT would help here.

The optimiser should be smart enough to do it in one scan of the table.

DECLARE @threshold AS INT;
DECLARE @days AS INT;

SET @threshold = 10;
SET @days = 31;

WITH
CTE_Purchases
AS
(
    SELECT
        --do this to revert columns and count as one case both Col1,Col2 and Col2,Col1
        CASE 
            WHEN UserID1 < UserID2 
            THEN UserID1 
            ELSE UserID2 
        END AS UserID1
        ,CASE 
            WHEN UserID1 < UserID2 
            THEN UserID2 
            ELSE UserID1 
        END AS UserID2
        ,ProductID
        ,PurchaseDate
    FROM
        Purchases
    WHERE 
        Deadline BETWEEN DATEADD(day, -@days, GETDATE()) AND GETDATE()
)
,CTE_Counts
AS
(
    SELECT
        UserID1
        ,UserID2
        ,ProductID
        ,PurchaseDate
        ,COUNT(*) OVER (PARTITION BY UserID1, UserID2) AS Counter
        -- calc COUNT for groups without explicit GROUP BY
    FROM CTE_Purchases
)
SELECT
    UserID1
    ,UserID2
    ,ProductID
    ,PurchaseDate
    ,Counter
FROM CTE_Counts
WHERE Counter > @threshold
-- this filter is instead of your HAVING
;

Upvotes: 2

Philip P.
Philip P.

Reputation: 1212

Disclaimer: I have not tested the code, it was written outside of the T-SQL IDE. Code below is based on the assumption that the following is always true: UserID1 != UserID2.

1) I would suggest using a MAX/MIN values solution to treat [Col1,Col2] the same way as [Col2,Col1]. It's probably going to perform better and handles NULLs properly too. You need SQL Server 2008 (or higher) for this to work.

SELECT
    (SELECT MAX(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as UserID1,
    (SELECT MIN(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as UserID2
FROM
    Purchases

2) Now we need to actually count the interactions between them, which should be easy enough. To keep the code clean we can use a CTE for our previous statement, I'm adding the deadline filter there:

;WITH CTE_UserInteractions AS (
    SELECT
        (SELECT MAX(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as FirstUser,
        (SELECT MIN(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as SecondUser
    FROM
        Purchases
    WHERE
        Deadline BETWEEN DATEADD(day,-@days,GETDATE()) AND GETDATE()
)

SELECT
    FirstUser,
    SecondUser
FROM
    CTE_UserInteractions
GROUP BY
    FirstUser, SecondUser
HAVING
    COUNT(*) > @Threshold

Quick note here: one might find that calculating the left deadline boundary ahead of time positively impacts performance. For example, before running the batch we can do:

DECLARE @StartDate DATETIME = DATEADD(DAY,-@days,GETDATE())

And then we can use the @StartDate in the WHERE clause.

3) Finally, we can use CROSS APPLY to get the list of products and purchases for user "pairs" that are left as a result. We can either use a sub-select (my solution) or pre-populate a temporary table with the step #2 results if performance is affected.

;WITH CTE_UserInteractions AS (
    SELECT
        (SELECT MAX(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as FirstUser,
        (SELECT MIN(usr) FROM (VALUES (UserID1), (UserID2) as User(usr)) as SecondUser
    FROM
        Purchases AS p1
    WHERE
        Deadline BETWEEN DATEADD(day,-@days,GETDATE()) AND GETDATE()
)

SELECT
    groupedUsers.FirstUser as UserID1,
    groupedUsers.SecondUser as UserID2,
    products.ProductID,
    products.PurchaseDate
FROM (
    SELECT
        FirstUser,
        SecondUser
    FROM
        CTE_UserInteractions
    GROUP BY
        FirstUser, SecondUser
    HAVING
        COUNT(*) > @Threshold
) groupedUsers
CROSS APPLY (
    SELECT
        ProductID, PurchaseDate
    FROM
        Purchases AS p1
    WHERE
        p1.UserID1 = FirstUser AND p1.UserID2 = SecondUser
    UNION ALL
    SELECT
        ProductID, PurchaseDate
    FROM
        Purchases AS p2
    WHERE
        p2.UserID2 = FirstUser AND p2.UserID1 = SecondUser
) products

Upvotes: 0

Related Questions