Reputation: 47
I have data ordered in time and I want to aggregate (compute total duration per interval) with respect to a certain column that changes its value. I want to keep first date_and_time and the total duration it lasted in that mode. The modes are appearing several times and for each I want to keep a separate duration ordered in time. How can I do this in Hive? Thank you.
date_and_time mode duration
2017-01-28 00:52:20 mode1 0
2017-01-28 00:52:20 mode1 2000
2017-01-28 00:52:22 mode2 0
2017-01-28 00:52:22 mode2 59000
2017-01-28 00:53:21 mode2 19000
2017-01-28 00:53:21 mode2 0 ----------->
date_and_time mode duration
2017-01-28 00:52:20 mode1 2000
2017-01-28 00:52:22 mode2 59000+19000
Upvotes: 0
Views: 158
Reputation: 1269873
Are you just looking for aggregation?
select max(date_and_time), mode, sum(duration)
from t
group by mode;
If you have "groups" of modes and want to keep only adjacent rows together, then this is an example of a gaps-and-islands problem. This is a little tricky, because you have duplicate date/time values, but you can use two levels of aggregation and the difference of row_numbers():
select mode, max(date_and_time), sum(duration)
from (select date_and_time, mode, sum(duration) as duration,
row_number() over (order by date_and_time) as seqnum,
row_number() over (partition by mode order by date_and_time) as seqnum_2
from t
group by date_and_time, mode
) t
group by mode, seqnum - seqnum_2;
Upvotes: 1