Pooja Nayak
Pooja Nayak

Reputation: 182

Grouping based on multiple columns and finding max value in each group in Hive

I have a hive table hive_tab containing 3 columns as :

+---------------------+
|   date   |id | desc |
+---------------------+
|2017-05-31|100| high |
|2017-05-30|202| high |
|2017-05-31|102|medium|
|2017-05-31|102|medium|
|2017-05-31|102|  low |
|2017-05-31|101|  low |
|2017-05-30|201|medium|
|2017-05-31|100| high |
|2017-05-31|100|  low |
|2017-05-31|100| high |
|2017-05-30|200|  low |
|2017-05-30|201|medium|
|2017-05-30|201|medium|
|2017-05-30|202| high |
|2017-05-30|201| high |
|2017-05-30|201|  low |
|2017-05-30|201|  low |
|2017-05-30|202|medium|
+---------------------+

Expected output is :

+----------------------------------+
|   date   |id | desc | count_desc |
+----------------------------------+
|2017-05-31|100| high |     3      |
|2017-05-31|101|  low |     1      |
|2017-05-31|102|medium|     2      |
|2017-05-30|200|  low |     1      |
|2017-05-30|201|medium|     3      |
|2017-05-30|202| high |     2      |
+----------------------------------+

Data is : Perday(date) there can be any number of IDs.Each ID will have any number of desc as high,medium,low.

We want the most frequently appearing desc per day per id as mentioned in the expected output.

Already tried with the following query :

select A.date,A.id,A.desc,max(c)
from(
select date,id,desc,count(desc) c
from hive_tab group by date,id,desc)A
group by id,c,date,desc;

But output is not as expected.It's giving all the desc per day per id instead of giving only most frequently appearing desc per day per id.

Any suggestions would be helpful at the earliest.

Thanks

Upvotes: 2

Views: 9567

Answers (1)

nobody
nobody

Reputation: 11090

Remove the desc from the group by clause since you want the max for each group of date and id.

select A.date,A.id,max(c)
from
(
   select date,id,desc,count(desc) c
   from hive_tab 
   group by date,id,desc
) A
group by A.date,A.id;

Using row_number() and partition

select * from 
(
   select 
        date,
        id,
        desc,
        max(desc) over (partition by date,id,desc) as max_desc,
        row_number() over (partition by date,id,desc order by max_desc desc) as row_no
   from hive_tab
) res 
where res.row_no = 1

Upvotes: 3

Related Questions