Reputation: 182
I have a hive table hive_tab containing 3 columns as :
+---------------------+
| date |id | desc |
+---------------------+
|2017-05-31|100| high |
|2017-05-30|202| high |
|2017-05-31|102|medium|
|2017-05-31|102|medium|
|2017-05-31|102| low |
|2017-05-31|101| low |
|2017-05-30|201|medium|
|2017-05-31|100| high |
|2017-05-31|100| low |
|2017-05-31|100| high |
|2017-05-30|200| low |
|2017-05-30|201|medium|
|2017-05-30|201|medium|
|2017-05-30|202| high |
|2017-05-30|201| high |
|2017-05-30|201| low |
|2017-05-30|201| low |
|2017-05-30|202|medium|
+---------------------+
Expected output is :
+----------------------------------+
| date |id | desc | count_desc |
+----------------------------------+
|2017-05-31|100| high | 3 |
|2017-05-31|101| low | 1 |
|2017-05-31|102|medium| 2 |
|2017-05-30|200| low | 1 |
|2017-05-30|201|medium| 3 |
|2017-05-30|202| high | 2 |
+----------------------------------+
Data is : Perday(date) there can be any number of IDs.Each ID will have any number of desc as high,medium,low.
We want the most frequently appearing desc per day per id as mentioned in the expected output.
Already tried with the following query :
select A.date,A.id,A.desc,max(c)
from(
select date,id,desc,count(desc) c
from hive_tab group by date,id,desc)A
group by id,c,date,desc;
But output is not as expected.It's giving all the desc per day per id instead of giving only most frequently appearing desc per day per id.
Any suggestions would be helpful at the earliest.
Thanks
Upvotes: 2
Views: 9567
Reputation: 11090
Remove the desc from the group by clause since you want the max for each group of date and id.
select A.date,A.id,max(c)
from
(
select date,id,desc,count(desc) c
from hive_tab
group by date,id,desc
) A
group by A.date,A.id;
Using row_number() and partition
select * from
(
select
date,
id,
desc,
max(desc) over (partition by date,id,desc) as max_desc,
row_number() over (partition by date,id,desc order by max_desc desc) as row_no
from hive_tab
) res
where res.row_no = 1
Upvotes: 3