Reputation: 1030
What is efficient way to remove successive row with duplicate values in specific fields in hive? for example:
Input:
ID field1 field2 date
1 a b 2015-01-01
1 a b 2015-01-02
2 e d 2015-01-03
output:
ID field1 field2 date
1 a b 2015-01-01
2 e d 2015-01-03
Thanks in advance
Upvotes: 1
Views: 94
Reputation: 32402
One way to remove successive duplicates is to use lag
to check the previous id
and only keep rows where the previous id
is different:
select * from (
select * ,
lag(id) over (order by date) previous_id
from mytable
) t where t.previous_id <> t.id
or t.previous_id is null -- accounts for the 1st row
If you also need to check field1
and field2
, then you can add separate lag
statements for each field:
select * from (
select * ,
lag(id) over (order by date) previous_id,
lag(field1) over (order by date) previous_field1
from mytable
) t where (t.previous_id <> t.id and t.previous_field1 <> field1)
or t.previous_id is null
Upvotes: 2