OctavianWR
OctavianWR

Reputation: 227

Rolling distinct count 30 day

I need to turn this event tracker dataset from AWS Athena

timestamp      id     event
1577863551     1      home
1577863555     1      profile
1577863555     2      home

into Monthly Active User or unique user that active from 30-days ago till now. for example

date        MAU
2/1/2020    2000
2/2/2020    2500

2000 MAU means that from 1/3/2020 to 2/1/2020 there is 2000 unique user that active.
2500 MAU means that from 1/4/2020 to 2/2/2020 there is 2500 unique user that active

Upvotes: 0

Views: 688

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269823

This is rather complicated. count(distinct) over would be so much simpler! But here is the idea.

You want to get periods when a user is counted. The idea is to generate a user_inc which is 1 when the user starts being counts and -1 when it stops.

But this is not easy to count. The idea is to add rows for when a user starts being counted and stopped being counted (by adding 31 days) -- with flags of 1 and -1` for these periods. Then a cumulative sum determines if a user had activity on that date. Filtering gets the first and last rows of periods with activity.

So, this looks like:

with t as (
      select id, dte, sum(sum(inc)) over (partition by id order by dte) as running_ins
      from ((select id, date(from_unixtime(timestamp)) as dte, 1 as inc
             from event_tracker
            ) union all
            (select id, date(from_unixtime(timestamp)) + interval '31' day as dte, -1 as inc
             from event_tracker
            )
           ) id
      group by id, dte
     ),
     first_last as (
      select id, dte, (case when running_ins > 0 then 1 else -1 end) as user_inc
      from (select t.*,
                   lag(running_ins) over (partition by id order by dte) as prev_running_ins
            from t
           ) t
      where prev_running_ins is null or
            prev_running_ins = 0 and running_ins > 0 or
            prev_running_ins > 0 and running_ins = 0
     )
select fl.dte,
       sum(sum(user_inc)) over (order by fl.dte) as distinct_30_days
from first_last fl
group by fl.dte;

Here is a db<>fiddle using Postgres.

Upvotes: 2

Related Questions