lydias
lydias

Reputation: 841

Hive SQL: How to identify duplicates across groups

I'm hoping to find duplicates across grouped IDs based on item, such as the following:

Hoping to derive the Dup variable based on duplicates of Item by ID groups.In other words, if the same item is found in another ID then Dup is 'yes'. If the Item only repeats within the same ID, then Dup is 'no'.

Table1:

+----+--------+-----+
| ID |  Item  | Dup |
+----+--------+-----+
| a  | apple  | yes |
| a  | apple  | yes |
| b  | apple  | yes |
| c  | orange | no  |
| c  | orange | no  |
| d  | pear   | yes |
| f  | pear   | yes |
| e  | mango  | no  |
+----+--------+-----+

I'm not sure if to start off by de-duplicating the dataset by ID and Item will make it easier.

Finalized dataset for summary (de-duped) :

Table2

+----+--------+-----+
| ID |  Item  | Dup |
+----+--------+-----+
| a  | apple  | yes |
| b  | apple  | yes |
| c  | orange | no  |
| d  | pear   | yes |
| f  | pear   | yes |
| e  | mango  | no  |
+----+--------+-----+

How should I construct a hive query to detect the duplicates? Any help will be appreciative!

Upvotes: 0

Views: 47

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1269503

Hmmm . . . if I understand correctly, you can use aggregation and a window function:

select id, item,
       (case when count(*) over (partition by item) > 1 then 'yes' else 'no' end) as dups
from t
group by id, item;

Upvotes: 1

Related Questions