Reputation: 841
I'm hoping to find duplicates across grouped IDs based on item, such as the following:
Hoping to derive the Dup variable based on duplicates of Item by ID groups.In other words, if the same item is found in another ID then Dup is 'yes'. If the Item only repeats within the same ID, then Dup is 'no'.
Table1:
+----+--------+-----+
| ID | Item | Dup |
+----+--------+-----+
| a | apple | yes |
| a | apple | yes |
| b | apple | yes |
| c | orange | no |
| c | orange | no |
| d | pear | yes |
| f | pear | yes |
| e | mango | no |
+----+--------+-----+
I'm not sure if to start off by de-duplicating the dataset by ID and Item will make it easier.
Finalized dataset for summary (de-duped) :
Table2
+----+--------+-----+
| ID | Item | Dup |
+----+--------+-----+
| a | apple | yes |
| b | apple | yes |
| c | orange | no |
| d | pear | yes |
| f | pear | yes |
| e | mango | no |
+----+--------+-----+
How should I construct a hive query to detect the duplicates? Any help will be appreciative!
Upvotes: 0
Views: 47
Reputation: 1269503
Hmmm . . . if I understand correctly, you can use aggregation and a window function:
select id, item,
(case when count(*) over (partition by item) > 1 then 'yes' else 'no' end) as dups
from t
group by id, item;
Upvotes: 1