Reputation: 1900
I'm quite new to spark SQL. I struggle to combine operations properly. What I want can be a bit tricky:
From values :
##########
# events #
##########
# 'M' #
# 'M' #
# 'F' #
# NULL #
##########
I'm looking for a query that can store the distribution into a map like:
##################
# distribution #
##################
#{'M': 2, 'F': 1}#
##################
Notice, that I don't want NULL values.
I tried several SQL queries that don't works, such as:
Making a fake list SELECT collect_list(concat(col,"\",\"",cnt)) from (SELECT col, count(col) as cnt FROM VALUES (NULL), ("M"), ("M"), ("F") AS tab(col) GROUP BY col);
=> ["F","1","M","2"]
Then pass it to the map function Select map(mp) from (SELECT collect_list(concat(col,"\",\"",cnt)) as mp from (SELECT col, count(col) as cnt FROM VALUES (NULL), ("M"), ("M"), ("F") AS tab(col) GROUP BY col));
=> data type mismatch
Dear community, your help will be welcome!
PS: I want to reproduce the histogram function of AWS Presto
Upvotes: 1
Views: 261
Reputation: 42402
You can create the map using group by and map_from_entries
:
select
map_from_entries(collect_list((events, c))) as dist
from (
select
events,
count(1) as c
from df
where events is not null
group by events
);
+----------------+
| dist|
+----------------+
|[F -> 1, M -> 2]|
+----------------+
Upvotes: 1