NEO
NEO

Reputation: 2001

Selecting the max of a count column with a group by in spark sql

I have the following data:

yr char cnt
1   a   27
1   g   20
3   b   50
3   z   70

I like to get the year, only max count of cnt field. i.e,

yr char count
1   a   27
3   z   70

I tried to use a SQL like below:

SELECT yr, char, max(count(cnt)) as count
FROM view
GROUP BY yr,char

But it resulted in an error saying the max cannot be used with count in SparkSQL. How can I get the result I want as shown above?

Upvotes: 1

Views: 1498

Answers (2)

Tomasz Krol
Tomasz Krol

Reputation: 668

This should work

sql("select a.yr, a.char, a.cnt from view a join (select yr, max(cnt) as cnt  from view group by yr) b on a.yr = b.yr and b.cnt = a.cnt").show()

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269753

This would often be done using row_number():

select yr, char, cnt
from (select yr, char, count(*) as cnt,
             row_number() over (partition by yr order by count(*) desc) as seqnum
      from view
      group by yr, char
     ) yc
where seqnum = 1;

Note: In the event of ties, this returns an arbitrary one of them. If you want all of them, use rank() or dense_rank().

Upvotes: 0

Related Questions