Reputation: 1527
Some what related: Filtering a Query based on a Date and Window function in Snowflake
I need to create a query that count the number of occurance of an id
in a -+ 90 days window, similar to this but as a window function, is that possible?
WITH fake_data(id, DATE) as (
SELECT * FROM VALUES
-- this id has visted once
(1, '2022-04-14'::date),
-- this id has visited 3 times
(3, '2022-01-13'::date),
(3, '2022-03-13'::date),
(3, '2022-05-13'::date),
-- this id is a huge vistor
(5, '2022-01-01'::date),
(5, '2022-02-01'::date),
(5, '2022-05-01'::date),
(5, '2022-06-01'::date),
(5, '2022-08-01'::date)
)
select * from (
select
count_if("change" between -90 and 90) over (partition by ID, t1.DATE) "c",
*
from fake_data as t1
left outer join lateral (
select t1.DATE - t.DATE "change", t.DATE "t_DATE"
from fake_data AS t
where t1.id = t.id and t1.DATE - t.DATE between -90 and 9
) as t2
order by ID, t1.DATE, "change"
)
where "change" = 0;
Result (change
and t_DATE
are just included for reference):
c | ID | DATE | change | t_DATE |
---|---|---|---|---|
1 | 1 | 2022-04-14 | 0 | 2022-04-14 |
2 | 3 | 2022-01-13 | 0 | 2022-01-13 |
3 | 3 | 2022-03-13 | 0 | 2022-03-13 |
2 | 3 | 2022-05-13 | 0 | 2022-04-13 |
2 | 5 | 2022-01-01 | 0 | 2022-01-01 |
3 | 5 | 2022-02-01 | 0 | 2022-02-01 |
3 | 5 | 2022-05-01 | 0 | 2022-05-01 |
3 | 5 | 2022-06-01 | 0 | 2022-06-01 |
2 | 5 | 2022-08-01 | 0 | 2022-08-01 |
This is what I like to do but it doesn't seem like the current row's DATE is available (or I can use alias):
select
count_if(DATE - d between -90 and 90) over (partition by id, DATE as d) as "c",
id,
date
from fake_data;
Upvotes: 1
Views: 2410
Reputation: 4428
Snowflake now supports RANGE clause in the Window Function. Example with your sample data:
visitor_data
)VISITOR_ID | DATE_VISITED |
---|---|
1 | 2022-04-14 |
3 | 2022-01-13 |
3 | 2022-03-13 |
3 | 2022-05-13 |
5 | 2022-01-01 |
5 | 2022-02-01 |
5 | 2022-05-01 |
5 | 2022-06-01 |
5 | 2022-08-01 |
select
*
, count(*) over (partition by visitor_id
order by date_visited
range between interval '90 day' preceding and current row
) as count_90_days
from visitor_data;
VISITOR_ID | DATE_VISITED | COUNT_90_DAYS |
---|---|---|
3 | 2022-01-13 | 1 |
3 | 2022-03-13 | 2 |
3 | 2022-05-13 | 2 |
1 | 2022-04-14 | 1 |
5 | 2022-01-01 | 1 |
5 | 2022-02-01 | 2 |
5 | 2022-05-01 | 2 |
5 | 2022-06-01 | 2 |
5 | 2022-08-01 | 2 |
More info: https://qosf.com/RANGE-clause-in-window-functions-practical-examples.html
Upvotes: 1
Reputation: 25938
hmm,
even though it's not what you want, your SQL can be written:
select
t1.DATE - t.DATE as change
,count_if(abs(t1.DATE - t.DATE) <= 90) over (partition by t1.ID, t1.DATE) as c
,t1.*
,t.date as t_date
from fake_data as t1
left join fake_data as t
on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change
But given the join is the same as your count_if, that can also be written:
select
t1.DATE - t.DATE as change
,count(*) over (partition by t1.ID, t1.DATE) as c
,t1.*
,t.date as t_date
from fake_data as t1
left join fake_data as t
on t1.id = t.id and abs(t1.DATE - t.DATE) <= 90
qualify change = 0
order by t1.ID, t1.DATE, change
But given Window functions do not have a "this row" as you wish for the time range, you could work around that by using a Javascript UDTF and for each row build a in memory set, and do the count via that, and then emit that in the final, and then join to that.
At which point you might as well explode the data and do equi-joins in raw SQL, which for +90,-90 days, is likely to still be rather fast for massive data, as compared to self joins
Thus for massive data, this should perform much better:
WITH fake_data(id, DATE) as (
SELECT * FROM VALUES
-- this id has visted once
(1, '2022-04-14'::date),
-- this id has visited 3 times
(3, '2022-01-13'::date),
(3, '2022-03-13'::date),
(3, '2022-05-13'::date),
-- this id is a huge vistor
(5, '2022-01-01'::date),
(5, '2022-02-01'::date),
(5, '2022-05-01'::date),
(5, '2022-06-01'::date),
(5, '2022-08-01'::date)
), range as (
select row_number() over (order by null)-91 as rn
from table(generator(ROWCOUNT => 181))
), exploded as (
select
id,
dateadd('day', e.rn, d.date) as t_date
from fake_data as d
cross join range as e
)
select
f.*
,count(t_date) as c
from fake_data as f
join exploded as e
on f.id = e.id and f.date = t_date
group by f.id, f.date
order by f.id, f.DATE
;
ID | DATE | C |
---|---|---|
1 | 2022-04-14 | 1 |
3 | 2022-01-13 | 2 |
3 | 2022-03-13 | 3 |
3 | 2022-05-13 | 2 |
5 | 2022-01-01 | 2 |
5 | 2022-02-01 | 3 |
5 | 2022-05-01 | 3 |
5 | 2022-06-01 | 3 |
5 | 2022-08-01 | 2 |
Upvotes: 2