Grouping based on multiple columns and finding max value in each group in Hive

Question

I have a hive table hive_tab containing 3 columns as :

+---------------------+
|   date   |id | desc |
+---------------------+
|2017-05-31|100| high |
|2017-05-30|202| high |
|2017-05-31|102|medium|
|2017-05-31|102|medium|
|2017-05-31|102|  low |
|2017-05-31|101|  low |
|2017-05-30|201|medium|
|2017-05-31|100| high |
|2017-05-31|100|  low |
|2017-05-31|100| high |
|2017-05-30|200|  low |
|2017-05-30|201|medium|
|2017-05-30|201|medium|
|2017-05-30|202| high |
|2017-05-30|201| high |
|2017-05-30|201|  low |
|2017-05-30|201|  low |
|2017-05-30|202|medium|
+---------------------+

Expected output is :

+----------------------------------+
|   date   |id | desc | count_desc |
+----------------------------------+
|2017-05-31|100| high |     3      |
|2017-05-31|101|  low |     1      |
|2017-05-31|102|medium|     2      |
|2017-05-30|200|  low |     1      |
|2017-05-30|201|medium|     3      |
|2017-05-30|202| high |     2      |
+----------------------------------+

Data is : Perday(date) there can be any number of IDs.Each ID will have any number of desc as high,medium,low.

We want the most frequently appearing desc per day per id as mentioned in the expected output.

Already tried with the following query :

select A.date,A.id,A.desc,max(c)
from(
select date,id,desc,count(desc) c
from hive_tab group by date,id,desc)A
group by id,c,date,desc;

But output is not as expected.It's giving all the desc per day per id instead of giving only most frequently appearing desc per day per id.

Any suggestions would be helpful at the earliest.

Thanks

Grouping based on multiple columns and finding max value in each group in Hive

Answers (1)

Related Questions