How to get the id of max count group in hive?

Question

I have a table like this:

id , m_id , group_id
 1 ,  a   ,    0 
 1 ,  b   ,    0 
 1 ,  c   ,    1 
 1 ,  d   ,    1 
 2 ,  e   ,    0 
 2 ,  f   ,    0 
 2 ,  g   ,    0
 2 ,  h   ,    1 
 2 ,  i   ,    1

For each id, I would like to get the m_id which they belong to the group that has max number of m_id. If there is a tie, I will just take a random group of m_id. Hence the expected output will be like:

id , m_id 
 1 ,  a    
 1 ,  b    
 2 ,  e    
 2 ,  f    
 2 ,  g

Notice: the number from group_id is only an indicator of group identification under each id. i.e. group_id = 0 does not not mean the same thing between id=1, and id=2.

My original idea is to get the max(group_id) group by (id,m_id), and return the id,m_id which has the max(group_id). However, this approach wont help on the tie situation (id = 2 cases).

Really hope someone can help me on this!

Thanks!

nobody · Accepted Answer

Use row_number() and partition the group by id to get the max grouping.Then self join to get the max grouping for each id,group_id

CREATE TABLE test 
(
  id integer , m_id char(1) , group_id integer
);
INSERT INTO test (id,m_id,group_id) VALUES (1,'a',0);
INSERT INTO test (id,m_id,group_id) VALUES (1,'b',0);
INSERT INTO test (id,m_id,group_id) VALUES (1,'c',1);
INSERT INTO test (id,m_id,group_id) VALUES (1,'d',1);
INSERT INTO test (id,m_id,group_id) VALUES (2,'e',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'f',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'g',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'h',1);
INSERT INTO test (id,m_id,group_id) VALUES (2,'i',1);

select b.id,b.group_id,b.m_id
from (
         select id,group_id,row_number() over(partition by id order by id,group_id,count(*) desc) as r_no
         from test
         group by id,group_id
     ) a
join test b on b.id=a.id and b.group_id=a.group_id
where a.r_no=1

Output

How to get the id of max count group in hive?

Answers (2)

Related Questions