remove successive rows in hive

Question

What is efficient way to remove successive row with duplicate values in specific fields in hive? for example:

Input:

 ID field1  field2 date
 1   a       b     2015-01-01
 1   a       b     2015-01-02
 2   e       d     2015-01-03

output:

ID field1  field2 date
 1   a       b     2015-01-01
 2   e       d     2015-01-03

Thanks in advance

FuzzyTree · Accepted Answer

One way to remove successive duplicates is to use lag to check the previous id and only keep rows where the previous id is different:

select * from (
    select * , 
        lag(id) over (order by date) previous_id
    from mytable
) t where t.previous_id <> t.id 
or t.previous_id is null -- accounts for the 1st row

If you also need to check field1 and field2, then you can add separate lag statements for each field:

select * from (
    select * , 
        lag(id) over (order by date) previous_id,
        lag(field1) over (order by date) previous_field1
    from mytable
) t where (t.previous_id <> t.id and t.previous_field1 <> field1)
or t.previous_id is null

remove successive rows in hive

Answers (1)

Related Questions