Reputation: 2083
I am trying to learn about deleting duplicate records from a Hive table.
My Hive table: 'dynpart' with columns: Id, Name, Technology
Id Name Technology
1 Abcd Hadoop
2 Efgh Java
3 Ijkl MainFrames
2 Efgh Java
We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.
Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.
Upvotes: 14
Views: 62703
Reputation: 4701
Just in case when your table has duplicate rows on few or selected columns. Suppose you have a table structure as shown down below:
id Name Technology
1 Abcd Hadoop
2 Efgh Java --> Duplicate
3 Ijkl Mainframe
2 Efgh Python --> Duplicate
Here id & Name columns having duplicate rows. You can use analytical function to get the duplicate row as:
select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;
This will give you output as:
id Name Technology row_num
2 Efgh Python 2
When you need to get both the duplicate rows:
select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;
Output as:
id Name Technology duplicate_count
2 Efgh Java 2
2 Efgh Python 2
Upvotes: 11
Reputation: 679
You can use insert overwrite statement to update data
insert overwrite table dynpart select distinct * from dynpart;
Upvotes: 31
Reputation: 591
you can insert distinct records into some other table
create table temp as select distinct * from dynpart
Upvotes: 2