Koba
Koba

Reputation: 1544

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.

enter image description here

I am trying to turn it into the table below. Email [email protected] was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.

enter image description here

Thanks for the help.

Upvotes: 2

Views: 5136

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

I would usually do this using row_number():

select date, count(*),
       sum(count(*)) over (order by date),
       sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
             row_number() over (partition by email order by date) as seqnum
      from t
     ) t
group by date
order by date;

This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.

Upvotes: 3

GMB
GMB

Reputation: 222402

Here is a solution that does not uses sum over, neither lag... And does produces the correct results.

Hence it could appear as simpler to read and to maintain.

select 
    t1.date_created,
    (select count(*) from my_table where date_created = t1.date_created) emails_created,
    (select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
    (select count( distinct email)  from my_table where date_created <= t1.date_created) running_count_distinct
from 
    (select distinct date_created from my_table) t1    
order by 1

Upvotes: 0

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.

select distinct dt
      ,count(*) over(partition by dt) as day_total
      ,count(*) over(order by dt) as cumsum
      ,sum(flag) over(order by dt) as cumdist
from (select t.*
            ,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag   
      from tbl t
     ) t    

DEMO HERE

Upvotes: 0

Related Questions