Fredrik Erlandsson
Fredrik Erlandsson

Reputation: 1527

Use condition in partition by window in SnowFlake

Some what related: Filtering a Query based on a Date and Window function in Snowflake

I need to create a query that count the number of occurance of an id in a -+ 90 days window, similar to this but as a window function, is that possible?

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
)
select * from (
    select 
    count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
    *
    from fake_data as t1
    left outer join lateral (
        select t1.DATE - t.DATE "change", t.DATE "t_DATE" 
        from fake_data AS t
        where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
     ) as t2
    order by ID, t1.DATE, "change"
)
where "change" = 0;

Result (change and t_DATE are just included for reference):

c ID DATE change t_DATE
1 1 2022-04-14 0 2022-04-14
2 3 2022-01-13 0 2022-01-13
3 3 2022-03-13 0 2022-03-13
2 3 2022-05-13 0 2022-04-13
2 5 2022-01-01 0 2022-01-01
3 5 2022-02-01 0 2022-02-01
3 5 2022-05-01 0 2022-05-01
3 5 2022-06-01 0 2022-06-01
2 5 2022-08-01 0 2022-08-01

This is what I like to do but it doesn't seem like the current row's DATE is available (or I can use alias):

select
  count_if(DATE - d between -90 and 90) over (partition by id, DATE  as d) as "c",
  id,
  date
from fake_data;

Upvotes: 1

Views: 2410

Answers (2)

Saqib Ali
Saqib Ali

Reputation: 4428

Snowflake now supports RANGE clause in the Window Function. Example with your sample data:

Input Table: Visitor Data (visitor_data)

VISITOR_ID DATE_VISITED
1 2022-04-14
3 2022-01-13
3 2022-03-13
3 2022-05-13
5 2022-01-01
5 2022-02-01
5 2022-05-01
5 2022-06-01
5 2022-08-01

SQL Query

select 
  *
  , count(*) over (partition by visitor_id
      order by date_visited
      range between interval '90 day' preceding and current row
    ) as count_90_days
from visitor_data;

Query Output:

VISITOR_ID DATE_VISITED COUNT_90_DAYS
3 2022-01-13 1
3 2022-03-13 2
3 2022-05-13 2
1 2022-04-14 1
5 2022-01-01 1
5 2022-02-01 2
5 2022-05-01 2
5 2022-06-01 2
5 2022-08-01 2

More info: https://qosf.com/RANGE-clause-in-window-functions-practical-examples.html

Upvotes: 1

Simeon Pilgrim
Simeon Pilgrim

Reputation: 25938

hmm,

even though it's not what you want, your SQL can be written:

select 
    t1.DATE - t.DATE as change
    ,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

But given the join is the same as your count_if, that can also be written:

select 
    t1.DATE - t.DATE as change
    ,count(*) over (partition by t1.ID, t1.DATE) as c
    ,t1.*
    ,t.date as t_date
from fake_data as t1
left join fake_data as t
    on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change

But given Window functions do not have a "this row" as you wish for the time range, you could work around that by using a Javascript UDTF and for each row build a in memory set, and do the count via that, and then emit that in the final, and then join to that.

At which point you might as well explode the data and do equi-joins in raw SQL, which for +90,-90 days, is likely to still be rather fast for massive data, as compared to self joins

Thus for massive data, this should perform much better:

WITH fake_data(id, DATE) as (
    SELECT * FROM VALUES
    -- this id has visted once
    (1, '2022-04-14'::date),
    -- this id has visited 3 times
    (3, '2022-01-13'::date),
    (3, '2022-03-13'::date),
    (3, '2022-05-13'::date),
    -- this id is a huge vistor
    (5, '2022-01-01'::date),
    (5, '2022-02-01'::date),
    (5, '2022-05-01'::date),
    (5, '2022-06-01'::date),
    (5, '2022-08-01'::date)
), range as (
    select row_number() over (order by null)-91 as rn
    from table(generator(ROWCOUNT => 181))
), exploded as (
    select
        id, 
        dateadd('day', e.rn, d.date) as t_date
    from fake_data as d
    cross join range as e
)
select
    f.*
    ,count(t_date) as c
from fake_data as f
join exploded as e
    on f.id = e.id and f.date = t_date
group by f.id, f.date
order by f.id, f.DATE
;
ID DATE C
1 2022-04-14 1
3 2022-01-13 2
3 2022-03-13 3
3 2022-05-13 2
5 2022-01-01 2
5 2022-02-01 3
5 2022-05-01 3
5 2022-06-01 3
5 2022-08-01 2

Upvotes: 2

Related Questions