data distribution with spark sql

Question

I'm quite new to spark SQL. I struggle to combine operations properly. What I want can be a bit tricky:

WHAT I HAVE

From values :

##########
# events #
##########
#  'M'   #
#  'M'   #
#  'F'   #
#  NULL  #
##########

WHAT I WANT

I'm looking for a query that can store the distribution into a map like:

##################
#  distribution  #
##################
#{'M': 2, 'F': 1}#
##################

Notice, that I don't want NULL values.

WHAT I TRIED

I tried several SQL queries that don't works, such as:

Making a fake list SELECT collect_list(concat(col,"\",\"",cnt)) from (SELECT col, count(col) as cnt FROM VALUES (NULL), ("M"), ("M"), ("F") AS tab(col) GROUP BY col); => ["F","1","M","2"]

Then pass it to the map function Select map(mp) from (SELECT collect_list(concat(col,"\",\"",cnt)) as mp from (SELECT col, count(col) as cnt FROM VALUES (NULL), ("M"), ("M"), ("F") AS tab(col) GROUP BY col)); => data type mismatch

Dear community, your help will be welcome!

PS: I want to reproduce the histogram function of AWS Presto

mck · Accepted Answer

You can create the map using group by and map_from_entries:

select 
    map_from_entries(collect_list((events, c))) as dist 
from (
    select 
        events, 
        count(1) as c 
    from df 
    where events is not null 
    group by events
);

+----------------+
|            dist|
+----------------+
|[F -> 1, M -> 2]|
+----------------+

data distribution with spark sql

WHAT I HAVE

WHAT I WANT

WHAT I TRIED

Answers (1)

Related Questions