Reputation: 570
I want find duplicates in a table in Hive as shown below.
ID name phone
1 John 602-230-4040
2 Brian 602-230-3030
3 John 602-230-4040
4 Brian 602-230-3030
5 Jeff 602-230-4040
In relational database the easiest way to use count function with group by and having clause. When I used the following query,
select count(name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;
following exception was thrown
FAILED: UDFArgumentException DISTINCT keyword must be specified
Then I used distinct keyword in the query.
select count(distinct name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;
Obviously the query did not return any row because if I use distinct keyword, there will not be any duplicate record in the result.
I am not sure why Hive is making it mandatory to use distinct keyword with count function when used with group by clause.
Can anyone let me know how to find duplicate in Hive table?
Upvotes: 2
Views: 3324
Reputation: 5236
If I understand your use case correctly, you actually want COUNT(*)
since you're interested in a pure row count.
SELECT name, phone, COUNT(*) AS cnt FROM mytest GROUP BY name, phone HAVING cnt > 1;
When I used this query with your test data:
hive> SELECT id, name, phone FROM foo;
OK
1 John 602-230-4040
2 Brian 602-230-3030
3 John 602-230-4040
4 Brian 602-230-3030
5 Jeff 602-230-4040
Time taken: 0.32 seconds, Fetched: 5 row(s)
hive> SELECT name, phone, COUNT(*) AS cnt
> FROM foo GROUP BY name, phone HAVING cnt > 1;
...
... Lots of MapReduce spam
...
Brian 602-230-3030 2
John 602-230-4040 2
Upvotes: 3