Thomas Decaux
Thomas Decaux

Reputation: 22651

Cumulative distinct count with Spark SQL

Using Spark 1.6.2.

Here the data:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).

This should give:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3

Upvotes: 6

Views: 2242

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1269443

You should be able to do:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

Actually, I think a better approach is to record a visitor only on the first day s/he appears:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

Upvotes: 4

Vamsi Prabhala
Vamsi Prabhala

Reputation: 49260

In SQL, you could do this.

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • Get the first day a visitorID appears using min.
  • Count the rows per such minday found above.
  • Left join this to your original table and get the cumulative sum.

Another approach would be

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

Upvotes: 2

Fabiano Carvalho
Fabiano Carvalho

Reputation: 512

Try it's

select
    day,
    count(*),
    (
    select count(*) from your_table b
    where a.day >= b.day
    ) cumulative
from your_table as a
group by a.day
order by 1 

Upvotes: 0

Related Questions