Reputation: 116
I have the following events table, I would like to group them smaller time buckets as specified below.
The table have to be divided into smaller sets where start and end row of the set is determined by geohash if the geohash is same then set keep including the rows until it found next geohash as different.
key time_stamp geohash
k1 1 abcdfg
k1 5 abcdfg
k1 7 abcdf1
k1 9 abcdfg
k1 10 abcdf2
k1 12 abcdf2
k1 21 abcdf2
How can I produce the following output using Apache Spark SQL syntax
key geohash first_time last_time duration num_events
k1 abcdfg 1 5 4 2
k1 abcdf1 7 7 0 1
k1 abcdfg 9 9 0 1
k1 abcdf2 10 21 11 3
Can someone help me in achieving this.
Upvotes: 0
Views: 371
Reputation: 222672
This is a kind of gaps-and-island problem. Here is one way to solve it using row_number()
and aggregation:
select
key,
geohash,
min(timestamp) first_time,
max(timestamp) last_time,
max(timestamp) - min(timestamp) duration,
count(*) num_events
from (
select
t.*,
row_number() over(partition by key order by timestamp) rn1,
row_number() over(partition by key, geohash order by timestamp) rn2
from mytable t
) t
group by
key,
geohash,
rn1 - rn2
And, just for the fun of it: you could also do this with a conditional window sum:
select
key,
geohash,
min(timestamp) first_time,
max(timestamp) last_time,
max(timestamp) - min(timestamp) duration,
count(*) num_events
from (
select
t.*,
sum(case when lag_geohash = geohash then 0 else 1 end)
over(partition by key order by timestamp) grp
from (
select
t.*,
lag(geohash) over(partition by key order by timestamp) lag_geohash
from mytable t
) t
) t
group by
key,
geohash,
grp
Upvotes: 5