Hive Nested Group By

Question

I am trying to identify repeating groups of data and have been having hard time thought it sounds simple.

Following is my Dataset.

ID, TIME, Speed 
1,   10,   0
1,   11,   0
1,   12,   0
1,   12,   1
1,   13,   2
1,   14,   0
1,   15,   0
1,   16,   4

I am trying to accomplish the Min and Max of Repeating Zero speed. In this Case

ID, Min, Max, Value
1 , 10 , 12,    0
1,  14,  15,    0

I have tried Lead / Lag functions but am able to get each individual row. But not sets of zeros. I am trying to get this in Hive.

Any Help is greatly Appreciated.

Gordon Linoff · Accepted Answer

You can do this with a difference of row numbers:

select id, speed, min(time), max(time)
from (select t.*,
             (row_number() over (partition by id order by time) -
              row_number() over (partition by id, speed order by time)
             ) as grp
      from table t
     ) t
where speed = 0
group by id, speed, grp;

The difference of row numbers is constant for when adjacent values are the same.

Hive Nested Group By

Answers (1)

Related Questions