Aggregate data from intervals based on a condition in Hive

Question

I have data ordered in time and I want to aggregate (compute total duration per interval) with respect to a certain column that changes its value. I want to keep first date_and_time and the total duration it lasted in that mode. The modes are appearing several times and for each I want to keep a separate duration ordered in time. How can I do this in Hive? Thank you.

date_and_time       mode    duration       
2017-01-28 00:52:20 mode1   0
2017-01-28 00:52:20 mode1   2000
2017-01-28 00:52:22 mode2   0
2017-01-28 00:52:22 mode2   59000
2017-01-28 00:53:21 mode2   19000
2017-01-28 00:53:21 mode2   0       -----------> 

date_and_time       mode    duration
2017-01-28 00:52:20 mode1   2000
2017-01-28 00:52:22 mode2   59000+19000

Gordon Linoff · Accepted Answer

Are you just looking for aggregation?

select max(date_and_time), mode, sum(duration)
from t
group by mode;

If you have "groups" of modes and want to keep only adjacent rows together, then this is an example of a gaps-and-islands problem. This is a little tricky, because you have duplicate date/time values, but you can use two levels of aggregation and the difference of row_numbers():

select mode, max(date_and_time), sum(duration)
from (select date_and_time, mode, sum(duration) as duration,
             row_number() over (order by date_and_time) as seqnum,
             row_number() over (partition by mode order by date_and_time) as seqnum_2
      from t
      group by date_and_time, mode
     ) t
group by mode, seqnum - seqnum_2;

Aggregate data from intervals based on a condition in Hive

Answers (1)

Related Questions