Partha
Partha

Reputation: 570

Distinct in count function used along with group by in Hive

I want find duplicates in a table in Hive as shown below.

ID      name    phone
1       John    602-230-4040
2       Brian   602-230-3030
3       John    602-230-4040
4       Brian   602-230-3030
5       Jeff    602-230-4040

In relational database the easiest way to use count function with group by and having clause. When I used the following query,

select count(name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

following exception was thrown

FAILED: UDFArgumentException DISTINCT keyword must be specified

Then I used distinct keyword in the query.

select count(distinct name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

Obviously the query did not return any row because if I use distinct keyword, there will not be any duplicate record in the result.

I am not sure why Hive is making it mandatory to use distinct keyword with count function when used with group by clause.

Can anyone let me know how to find duplicate in Hive table?

Upvotes: 2

Views: 3324

Answers (1)

rchang
rchang

Reputation: 5236

If I understand your use case correctly, you actually want COUNT(*) since you're interested in a pure row count.

SELECT name, phone, COUNT(*) AS cnt FROM mytest GROUP BY name, phone HAVING cnt > 1;

When I used this query with your test data:

hive> SELECT id, name, phone FROM foo;
OK
1   John    602-230-4040
2   Brian   602-230-3030
3   John    602-230-4040
4   Brian   602-230-3030
5   Jeff    602-230-4040
Time taken: 0.32 seconds, Fetched: 5 row(s)
hive> SELECT name, phone, COUNT(*) AS cnt
    > FROM foo GROUP BY name, phone HAVING cnt > 1;
...
... Lots of MapReduce spam
...
Brian       602-230-3030    2
John        602-230-4040    2

Upvotes: 3

Related Questions