Stuck
Stuck

Reputation: 12292

multiple windowed counts with distinct user_id

Given an event table, I want to find the count of distinct user_ids for a specific event type in four different windows:

I came up with this idea of a query (non-working), which probably shows best what I try to accomplish:

select
  d as "Date",
  count(distinct user_id) over (
    order by d
    rows between 365 preceding and current row
  ) "Yearly Active",
  count(distinct user_id) over (
    order by d
    rows between 30 preceding and current row
  ) "Monthly Active",
  count(distinct user_id) over (
    order by d
    rows between 7 preceding and current row
  ) "Weekly Active",
  count(distinct user_id) as "Daily Active"
from (
  select
    date(to_timestamp(event_time/1000) at time zone 'Europe/Berlin') d,
    user_id
  from event_entity
  where type = 'REFRESH_TOKEN'
  group by 1, user_id
  order by d
) daily_users

But we cannot use distinct in window functions. What is a better approach to the query?

Upvotes: 0

Views: 114

Answers (2)

cogitoergosum
cogitoergosum

Reputation: 661

The best approach to this query is not using window functions, you can get the data you need just by using non-windowed count and case when statements. First get all distinct date and user_id values in a CTE and all distinct dates in another CTE

WITH daily_users AS(
    select distinct
    date(to_timestamp(event_time/1000) at time zone 'Europe/Berlin') d,
    user_id
    from event_entity
    where type = 'REFRESH_TOKEN'
),
all_dates AS(
   select distinct
   date(to_timestamp(event_time/1000) at time zone 'Europe/Berlin') d
   from event_entity
   where type = 'REFRESH_TOKEN'
)

Join the dates with the daily_users cte, then use the a CASE statement inside the COUNT, replacing user_id with NULL for all dates outside the relevant range:

select all_dates.d as "date",
    --daily
    count(distinct case when all_dates.d = daily_users.d then daily_users.user_id else null end) as daily_active,
    --weekly
    count(distinct case when all_dates.d - daily_users.d < 7 then daily_users.user_id else null end) as weekly_active,
    --30 days
    count(distinct case when all_dates.d - daily_users.d < 30 then daily_users.user_id else null end) as monthly_active,
    --yearly, just count distinct user_id because the 365 day condition is already provided by the join
    count(distinct daily_users.user_id) as yearly_active
from all_dates
inner join daily_users on all_dates.d >= daily_users.d and (all_dates.d - daily_users.d < 365) --filter only rows between the date in all_dates and one year before
group by all_dates.d

Some notes:

  • count ignores null values, that's why the case expressions set user_id to null in all rows outside the relevant period
  • in postgres, subtracting two dates returns an integer equal to the number of days between the two dates

Upvotes: 1

Pooya
Pooya

Reputation: 3183

Edited

If you want to use the windows function, you can use a trick. You can use the bellow formula to remove duplicate user_id:

abs(
  dense_rank() over (order by d, user_id)
  - dense_rank() over (order by d, user_id DESC)
  + 1
)

But I don't know the result of the query it is the same you expected.

Please check this structure. Is this query has resulted in what you expect?

select distinct on (d)
  d,
  ----
  abs(
    dense_rank() over asc_total_preceding
    - dense_rank() over desc_total_preceding
  ) + 1 as "Daily Active",
  ----
  abs(
    dense_rank() over asc_7_preceding
    - dense_rank() over desc_7_preceding
  ) + 1 as "Weekly Active",
  ----
  abs(
    dense_rank() over asc_30_preceding
    - dense_rank() over desc_30_preceding
  ) + 1 as "Monthly Active",
 ----
  abs(
    dense_rank() over asc_365_preceding
    - dense_rank() over desc_365_preceding
  ) + 1 as "Yearly Active"
from (
  select
    date(to_timestamp(event_time/1000) at time zone 'Europe/Berlin') d,
    user_id
  from event_entity
  where type = 'REFRESH_TOKEN'
  group by 1, user_id
  order by d
) daily_users
window
  asc_total_preceding as (order by d, user_id),
  desc_total_preceding as (order by d, user_id),
  asc_7_preceding as (order by d, user_id rows between 7 preceding and current row),
  desc_7_preceding as (order by d, user_id desc rows between 7 preceding and current row),
  asc_30_preceding as (order by d, user_id rows between 30 preceding and current row),
  desc_30_preceding as (order by d, user_id desc rows between 30 preceding and current row),
  asc_365_preceding as (order by d, user_id rows between 365 preceding and current row),
  desc_365_preceding as (order by d, user_id desc rows between 365 preceding and current row)

P.S (Old query)

In my opinion, better to change the query structure and use group by and filter to calculate the count of users on each date

select
  d as "Date",
  count(distinct user_id) filter (
    where d between current_date - 365 and current_date
  ) as "Yearly Active",
  count(distinct user_id) filter (
    where d between current_date - 30 and current_date
  ) as "Monthly Active",
  count(distinct user_id) filter (
    where d between current_date - 7 and current_date
  ) as "Weekly Active",
  count(distinct user_id) filter (
    where d = current_date
  ) as "Daily Active"
from (
  select
    date(to_timestamp(event_time/1000) at time zone 'Europe/Berlin') d,
    user_id
  from event_entity
  where type = 'REFRESH_TOKEN'
  group by 1, user_id
  order by d
) daily_users
group by d

Upvotes: 0

Related Questions