Reputation: 227
I need to turn this event tracker dataset from AWS Athena
timestamp id event
1577863551 1 home
1577863555 1 profile
1577863555 2 home
into Monthly Active User or unique user that active from 30-days ago till now. for example
date MAU
2/1/2020 2000
2/2/2020 2500
2000 MAU means that from 1/3/2020 to 2/1/2020 there is 2000 unique user that active.
2500 MAU means that from 1/4/2020 to 2/2/2020 there is 2500 unique user that active
Upvotes: 0
Views: 688
Reputation: 1269823
This is rather complicated. count(distinct) over
would be so much simpler! But here is the idea.
You want to get periods when a user is counted. The idea is to generate a user_inc
which is 1
when the user starts being counts and -1
when it stops.
But this is not easy to count. The idea is to add rows for when a user starts being counted and stopped being counted (by adding 31 days) -- with flags of 1
and -1` for these periods. Then a cumulative sum determines if a user had activity on that date. Filtering gets the first and last rows of periods with activity.
So, this looks like:
with t as (
select id, dte, sum(sum(inc)) over (partition by id order by dte) as running_ins
from ((select id, date(from_unixtime(timestamp)) as dte, 1 as inc
from event_tracker
) union all
(select id, date(from_unixtime(timestamp)) + interval '31' day as dte, -1 as inc
from event_tracker
)
) id
group by id, dte
),
first_last as (
select id, dte, (case when running_ins > 0 then 1 else -1 end) as user_inc
from (select t.*,
lag(running_ins) over (partition by id order by dte) as prev_running_ins
from t
) t
where prev_running_ins is null or
prev_running_ins = 0 and running_ins > 0 or
prev_running_ins > 0 and running_ins = 0
)
select fl.dte,
sum(sum(user_inc)) over (order by fl.dte) as distinct_30_days
from first_last fl
group by fl.dte;
Here is a db<>fiddle using Postgres.
Upvotes: 2