Reputation: 25
I am trying to find the daily count of frequent visitors from a very large data-set. Frequent visitors in this case are visitor IDs used on 2 distinct days in a rolling 3 day period.
My data set looks like the below:
ID | Date | Location | State | Brand |
1 | 2020-01-02 | A | CA | XYZ |
1 | 2020-01-03 | A | CA | BCA |
1 | 2020-01-04 | A | CA | XYZ |
1 | 2020-01-06 | A | CA | YQR |
1 | 2020-01-06 | A | WA | XYZ |
2 | 2020-01-02 | A | CA | XYZ |
2 | 2020-01-05 | A | CA | XYZ |
This is the result I am going for. The count in the visits column is equal to the count of distinct days from the date column, -2 days for each ID. So for ID 1 on 2020-01-05, there was a visit on the 3rd and 4th, so the count is 2.
Date | ID | Visits | Frequent Prior 3 Days
2020-01-01 |Null| Null | Null
2020-01-02 | 1 | 1 | No
2020-01-02 | 2 | 1 | No
2020-01-03 | 1 | 2 | Yes
2020-01-03 | 2 | 1 | No
2020-01-04 | 1 | 3 | Yes
2020-01-04 | 2 | 1 | No
2020-01-05 | 1 | 2 | Yes
2020-01-05 | 2 | 1 | No
2020-01-06 | 1 | 2 | Yes
2020-01-06 | 2 | 1 | No
2020-01-07 | 1 | 1 | No
2020-01-07 | 2 | 1 | No
2020-01-08 | 1 | 1 | No
2020-01-09 | 1 | null | Null
I originally tried to use the following line to get the result for the visits column, but end up with 3 in every successive row at whichever date it first got to 3 for that ID.
,
count(ID) over (Partition by ID order by Date ASC rows between 3 preceding and current row) as visits
I've scoured the forum, but every somewhat similar question seems to involve counting the values rather than the dates and haven't been able to figure out how to tweak to get what I need. Any help is much appreciated.
Upvotes: 2
Views: 671
Reputation: 1270653
I would be inclined to approach this by expanding out the days and visitors using a cross join
and then just window functions. Assuming you have all dates in the data:
select i.id, d.date,
count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) as cnt_visits,
(case when count(t.id) over (partition by i.id
order by d.date
rows between 2 preceding and current row
) >= 2
then 'Yes' else 'No'
end) as is_frequent_visitor
from (select distinct id from t) i cross join
(select distinct date from t) d left join
(select distinct id, date from t) t
on t.date = d.date and
t.id = i.id;
Upvotes: 1
Reputation: 222622
You can aggregate the dataset by user and date, then use window functions with a range frame to look at the three preceding rows.
You did not tell which database you are running - and not all databases support the window ranges, nor have the same syntax for literal intervals. In standard SQL, you would go:
select
id,
date,
count(*) cnt_visits
case
when sum(count(*)) over(
partition by id
order by date
range between interval '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from mytable
group by id, date
On the other hand, if you want a record for every user and every day (event when there is no visit), then it is a bit different. You can generate the dataset first, then bring the table with a left join
:
select
i.id,
d.date,
count(t.id) cnt_visits,
case
when sum(count(t.id)) over(
partition by i.id
order by d.date
rows between '3' day preceding and current row
) >= 2
then 'Yes'
else 'No'
end is_frequent_visitor
from (select distinct id from mytable) i
cross join (select distinct date from mytable) d
left join mytable t
on t.date = d.date
and t.id = i.id
group by i.id, d.date
Upvotes: 1