Sub Grouping the results using Apache Spark SQL

Question

I have the following events table, I would like to group them smaller time buckets as specified below.

The table have to be divided into smaller sets where start and end row of the set is determined by geohash if the geohash is same then set keep including the rows until it found next geohash as different.

key time_stamp  geohash
k1  1           abcdfg
k1  5           abcdfg
k1  7           abcdf1
k1  9           abcdfg
k1  10          abcdf2
k1  12          abcdf2
k1  21          abcdf2

How can I produce the following output using Apache Spark SQL syntax

key geohash first_time  last_time   duration    num_events
k1  abcdfg  1           5           4           2
k1  abcdf1  7           7           0           1
k1  abcdfg  9           9           0           1
k1  abcdf2  10          21          11          3

Can someone help me in achieving this.

GMB · Accepted Answer

This is a kind of gaps-and-island problem. Here is one way to solve it using row_number() and aggregation:

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        row_number() over(partition by key order by timestamp) rn1,
        row_number() over(partition by key, geohash order by timestamp) rn2
    from mytable t
) t
group by 
    key,
    geohash,
    rn1 - rn2

And, just for the fun of it: you could also do this with a conditional window sum:

select
    key, 
    geohash, 
    min(timestamp) first_time,
    max(timestamp) last_time,
    max(timestamp) - min(timestamp) duration,
    count(*) num_events
from (
    select
        t.*,
        sum(case when lag_geohash = geohash then 0 else 1 end) 
            over(partition by key order by timestamp) grp
    from (
        select
            t.*,
            lag(geohash) over(partition by key order by timestamp) lag_geohash
        from mytable t
    ) t 
) t
group by 
    key,
    geohash,
    grp

Sub Grouping the results using Apache Spark SQL

Answers (1)

Related Questions