Shekhar
Shekhar

Reputation: 11788

How to find duplicate rows in Hive?

I want to find duplicate rows from one of the Hive table for which I was given two approaches.

First approach is to use following two queries:

select count(*) from mytable; // this will give total row count 

second query is as below which will give count of distinct rows

select count(distinct primary_key1, primary_key2) from mytable;

With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.

My second approach to find duplicate is:

select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;

Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.

So I would like to know:

  1. If my first approach is correct - if yes then how do I find which rows are duplicated
  2. Why second approach is not providing list of rows which are duplicated?
  3. Is there any other way to find the duplicates?

Upvotes: 13

Views: 47809

Answers (3)

Alex
Alex

Reputation: 741

Hive does not validate primary and foreign key constraints.

Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.

That means that Hive allows duplicates in Primary Keys.

To solve your issue, you should do something like this:

select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;

This way you will get list of duplicated rows.

Upvotes: 23

Maneesh K Bishnoi
Maneesh K Bishnoi

Reputation: 139

Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.

SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

Upvotes: 1

rabkaman
rabkaman

Reputation: 131

analytic window function row_number() is quite useful and can provide the duplicates based upon the elements specified in the partition by clause. A simply in-line view and exists clause will then pinpoint what corresponding sets of records contain these duplicates from the original table. In some databases (like TD, you can forgo the inline view using a QUALIFY pragma option)

SQL1 & SQL2 can be combined. SQL2: If you want to deal with NULLs and not simply dismiss, then a coalesce and concatenation might be better in the

SELECT count(1) , count(distinct coalesce(keypart1 ,'') + coalesce(keypart2 ,'') )  
  FROM srcTable s

3) Finds all records, not just the > 1 records. This provides all context data as well as the keys so it can be useful when analyzing why you have dups and not just the keys.

select * from  srcTable s
where exists 
    ( select 1 from (
                SELECT  
                      keypart1,
                      keypart2,
                      row_number() over( partition by keypart1, keypart2 )  seq  
                FROM srcTable t 
                WHERE 
                  -- (whatever additional filtering you want) 
                ) t 
                where seq > 1 
                AND t.keypart1 = s.keypart1
                AND t.keypart2 = s.keypart2
    ) 

Upvotes: 1

Related Questions