Hive - How to combine a group by over columns A and B and a distinct over column C

Question

I need to create a query which selects from a particular table the users which have more than one different email. To distinguish users, I group them based on two fields: name and age. Let's see this with an example.

So I have a table like this:

name     age    email       phone
----------------------------------
Andy     20     Andy@du     1234
Berni    21     Berni@du    2345
Carol    22     Carol@du    3456
Andy     20     Andy@du     4321
Berni    21     Berni@et    2345
Dody     28     Dodi@du     7869
Carol    22     Carol@pt    3456

What I want to get is:

Berni    21    Berni@du, Berni@et
Carol    22    Carol@du, Carol@pt

Note that Andy is also twice in the database but with same email (what changes is the phone number). Because of this user I need to make a distinc over email, so only users with two different emails are selected.

With this query I am able to solve the issue and I have the desired result.

select * from 
(
    select  aux.name, 
            aux.age, 
            concat_ws(',',collect_set(email)) as email
    FROM
    (select a.name, a.age, a.email
        FROM TestUsers a
        RIGHT JOIN 
        (select  name,
                 age
                FROM TestUsers
                GROUP BY 
                name,
                age
                having count(*) > 1
        )b
    ON  a.name = b.name 
    AND a.age = b.age
    )aux
    GROUP BY aux.name, 
             aux.age
)tr
where locate(",",tr.email) > 0;

But I am sure it has to be a more efficient way than checking when there is not a comma in the email field(which means more than one email).

Has anyone in mind a better approach?

Gordon Linoff · Accepted Answer

If I understand correctly, you should be able to do this using a having clause:

select tu.name, tu.age,
       concat_ws(',', collect_list(tu.email)) as emails
from (select distinct tu.name, tu.age, tu.email
      from TestUsers tu
     ) tu
group by tu.name, tu.age
having count(*) > 1;

Actually, because collect_set() removes duplicates, this should work without a subquery:

select tu.name, tu.age,
       concat_ws(',', collect_set(tu.email)) as emails
from testusers tu
group by tu.name, tu.age
having min(tu.email) <> max(tu.email);

Hive - How to combine a group by over columns A and B and a distinct over column C

Answers (1)

Related Questions