michalrudko
michalrudko

Reputation: 1530

Spark SQL - aggregate columns into dictionary

I have the following data frame:

spark.sql("""
SELECT id, color, cnt
FROM ( 
  VALUES ('A','green', 5),
         ('A','yellow', 4),
         ('A','yellow',2),
         ('B','blue', 3),
         ('B','green',4),
         ('B','blue',1) 
) as T (id, color, cnt)
""")

and would like to aggregate it in such a way that for each id key I have a dictionary of counts and sums of cnt column. So the output would be:

+---+-----------------------------------------+
| id|         color_cnt | color_sum           |
+---+-----------------------------------------+
|  B|{blue:2, green:1}  | {blue:4, green:4}   | 
|  A|{green:1, yellow:2}| {green:5, yellow:6} |
+---+-------------------+---------------------+

Are there any Spark SQL functions that would help me achieve what I need? Thanks!

Upvotes: 0

Views: 588

Answers (1)

mck
mck

Reputation: 42342

The functions to_json and map_from_arrays would be helpful for you. If you want a map type in the dataframe, simply remove to_json.

spark.sql("""
SELECT
    id,
    to_json(map_from_arrays(collect_list(color), collect_list(count))) count,
    to_json(map_from_arrays(collect_list(color), collect_list(sum))) sum 
FROM (

    SELECT id, color, count(1) count, sum(cnt) sum
    FROM ( 
      VALUES ('A','green', 5),
             ('A','yellow', 4),
             ('A','yellow',2),
             ('B','blue', 3),
             ('B','green',4),
             ('B','blue',1) 
    ) as T (id, color, cnt)
    GROUP BY id, color)

GROUP BY id
""").show(truncate=False)

+---+----------------------+----------------------+
|id |count                 |sum                   |
+---+----------------------+----------------------+
|B  |{"blue":2,"green":1}  |{"blue":4,"green":4}  |
|A  |{"yellow":2,"green":1}|{"yellow":6,"green":5}|
+---+----------------------+----------------------+

Upvotes: 3

Related Questions