Reputation: 22651
Using Spark 1.6.2.
Here the data:
day | visitorID
-------------
1 | A
1 | B
2 | A
2 | C
3 | A
4 | A
I want to count how many distinct visitors by day + cumul with the day before (I dont know the exact term for that, sorry).
This should give:
day | visitors
--------------
1 | 2 (A+B)
2 | 3 (A+B+C)
3 | 3
4 | 3
Upvotes: 6
Views: 2242
Reputation: 1269443
You should be able to do:
select day, max(visitors) as visitors
from (select day,
count(distinct visitorId) over (order by day) as visitors
from t
) d
group by day;
Actually, I think a better approach is to record a visitor only on the first day s/he appears:
select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
from t
group by visitorId
) t
group by startday
order by startday;
Upvotes: 4
Reputation: 49260
In SQL, you could do this.
select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt
from (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
group by minday
) t
on t1.day=t.minday
group by t1.day
min
. Another approach would be
select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt
from tbl t1
left join (select visitorID,min(day) as minday
from tbl
group by visitorID
) t
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day
Upvotes: 2
Reputation: 512
Try it's
select
day,
count(*),
(
select count(*) from your_table b
where a.day >= b.day
) cumulative
from your_table as a
group by a.day
order by 1
Upvotes: 0