Reputation: 22701

Cumulative distinct count with Spark SQL

Using Spark 1.6.2.

Here the data:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).

This should give:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3

Tried self-join but really too slow
I am sure windowed function is what I am looking for but didnt manage to find it :/

Upvotes: 6

Answers (3)

Gordon Linoff

Reputation: 1270993

You should be able to do:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

Actually, I think a better approach is to record a visitor only on the first day s/he appears:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

Upvotes: 4

Vamsi Prabhala

Reputation: 49270

In SQL, you could do this.

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day

Get the first day a visitorID appears using min.
Count the rows per such minday found above.
Left join this to your original table and get the cumulative sum.

Another approach would be

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

Upvotes: 2

Fabiano Carvalho

Reputation: 512

Try it's

select
    day,
    count(*),
    (
    select count(*) from your_table b
    where a.day >= b.day
    ) cumulative
from your_table as a
group by a.day
order by 1

Upvotes: 0

Cumulative distinct count with Spark SQL

Answers (3)

Related Questions