Mark McGown
Mark McGown

Reputation: 1095

How to perform SQL join on lag date in Snowflake?

I have a table with charges and user id's and a separate table with plan changes. I need to join these tables in such a way that I know what the user's plan id was at the time of the charge in Snowflake.

Charge Table:

Charge_ID,User_ID,Inserted_At
AAAA     ,1234   ,2022-01-01 15:00:00
AAAA     ,1234   ,2022-02-01 15:00:00
BBBB     ,5678   ,2022-01-05 18:00:00
BBBB     ,5678   ,2022-02-07 18:00:00

Plan Table:

User_ID,Plan_ID,Inserted_At
1234   ,100    ,2022-01-01 13:00:00
1234   ,099    ,2022-01-01 14:00:00
1234   ,101    ,2022-01-18 13:00:00
5678   ,050    ,2022-01-04 13:00:00
5678   ,051    ,2022-02-08 13:00:00

Result:

Charge_ID,User_ID,Charge_Inserted_At  ,Plan_ID
AAAA     ,1234   ,2022-01-01 15:00:00 ,099
AAAA     ,1234   ,2022-02-01 15:00:00 ,101
BBBB     ,5678   ,2022-01-05 18:00:00 ,050
BBBB     ,5678   ,2022-02-07 18:00:00 ,050

Do I need to cross join and lag, if so how can I accomplish that? Is there a way to accomplish this that's optimally efficient beyond some type of cross join?

Upvotes: 1

Views: 628

Answers (2)

Lukasz Szozda
Lukasz Szozda

Reputation: 175586

One way to solve it is to "transpose" single "InsetedAt" to pair date_from and date_to using LAG/LEAD and perform range join:

Pseudocode:

WITH ChargeTransposed AS (
   SELECT Charge_id, User_ID, Inserted_At AS start_date, 
          LEAD(Inserted_at,1,'2999-12-31') OVER(PARTITION BY ChargeId, UserId 
                                                ORDER BY InsertedAt) AS end_date
   FROM ChargeTable 
)
SELECT *
FROM ChargeTransposed c
LEFT JOIN PlanTable p
  ON c.User_ID = p.User_id
 AND p.InsertedAt >= c.start_date AND p.InsetedAt < c.end_date
-- QUALIFTY ROW_NUMBER() OVER(PARTITION BY c.Charge_Id, c.User_Id,  c.Start_date 
            ORDER BY p.InsertAt DESC) = 1

Upvotes: 1

Greg Pavlik
Greg Pavlik

Reputation: 11046

This is an alternate approach to using a window function. It's a subquery to find the closest date with a self-join. This approach can perform better than a window function, especially if there's a known limitation in date ranges. For example, this is looking for the closest change backward in time, but if it's known that the change would never be more than N days in the past, adding that to the subquery can help performance a lot.

create table CHARGE(Charge_ID string, User_ID int, Inserted_At timestamp_ntz);
insert into charge (Charge_ID,User_ID,Inserted_At) values
('AAAA'     ,1234   ,'2022-01-01 15:00:00'),
('AAAA'     ,1234   ,'2022-02-01 15:00:00'),
('BBBB'     ,5678   ,'2022-01-05 18:00:00'),
('BBBB'     ,5678   ,'2022-02-07 18:00:00')
;

create table PLAN(User_ID int,Plan_ID int ,Inserted_At timestamp_ntz);
insert into PLAN(User_ID,Plan_ID ,Inserted_At)  values
(1234   ,100    ,'2022-01-01 13:00:00'),
(1234   ,099    ,'2022-01-01 14:00:00'),
(1234   ,101    ,'2022-01-18 13:00:00'),
(5678   ,050    ,'2022-01-04 13:00:00'),
(5678   ,051    ,'2022-02-08 13:00:00');

with X as
(
select   CHARGE_ID
        ,USER_ID
        ,INSERTED_AT CHARGE_INSERTED_AT
        ,(select max(P.INSERTED_AT) from PLAN P 
          where C.USER_ID = P.USER_ID
          and C.INSERTED_AT >= P.INSERTED_AT) LAST_INSERTED_AT
from CHARGE C
)
select X.CHARGE_ID, X.USER_ID, X.CHARGE_INSERTED_AT, P.PLAN_ID
from X left join PLAN P 
  on X.USER_ID = P.USER_ID and X.LAST_INSERTED_AT = P.INSERTED_AT

Upvotes: 1

Related Questions