Reputation: 2083

How to delete duplicate records from Hive table?

I am trying to learn about deleting duplicate records from a Hive table.

My Hive table: 'dynpart' with columns: Id, Name, Technology

Id  Name  Technology
1   Abcd  Hadoop
2   Efgh  Java
3   Ijkl  MainFrames
2   Efgh  Java

We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.

Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.

Upvotes: 14

Answers (3)

vikrant rana

Reputation: 4701

Just in case when your table has duplicate rows on few or selected columns. Suppose you have a table structure as shown down below:

id  Name    Technology
1   Abcd    Hadoop
2   Efgh    Java       --> Duplicate
3   Ijkl    Mainframe
2   Efgh    Python     --> Duplicate

Here id & Name columns having duplicate rows. You can use analytical function to get the duplicate row as:

select * from
(select Id,Name,Technology,
row_Number() over (partition By Id,Name order by id desc) as row_num
from yourtable)tab
where row_num > 1;

This will give you output as:

id  Name    Technology  row_num
2   Efgh    Python           2

When you need to get both the duplicate rows:

select * from
(select Id,Name,Technology,
count(*) over (partition By Id,Name order by id desc) as duplicate_count
from yourtable)tab
where duplicate_count> 1;

Output as:

id  Name    Technology  duplicate_count
2   Efgh    Java             2
2   Efgh    Python           2

Upvotes: 11

fi11er

Reputation: 679

You can use insert overwrite statement to update data

insert overwrite table dynpart select distinct * from dynpart;

Upvotes: 31

Shalaj

Reputation: 591

you can insert distinct records into some other table

create table temp as select distinct * from dynpart

Upvotes: 2

How to delete duplicate records from Hive table?

Answers (3)

Related Questions