Reputation: 1530
I have the following data frame:
spark.sql("""
SELECT id, color, cnt
FROM (
VALUES ('A','green', 5),
('A','yellow', 4),
('A','yellow',2),
('B','blue', 3),
('B','green',4),
('B','blue',1)
) as T (id, color, cnt)
""")
and would like to aggregate it in such a way that for each id
key I have a dictionary of counts and sums of cnt
column. So the output would be:
+---+-----------------------------------------+
| id| color_cnt | color_sum |
+---+-----------------------------------------+
| B|{blue:2, green:1} | {blue:4, green:4} |
| A|{green:1, yellow:2}| {green:5, yellow:6} |
+---+-------------------+---------------------+
Are there any Spark SQL functions that would help me achieve what I need? Thanks!
Upvotes: 0
Views: 588
Reputation: 42342
The functions to_json
and map_from_arrays
would be helpful for you. If you want a map type in the dataframe, simply remove to_json
.
spark.sql("""
SELECT
id,
to_json(map_from_arrays(collect_list(color), collect_list(count))) count,
to_json(map_from_arrays(collect_list(color), collect_list(sum))) sum
FROM (
SELECT id, color, count(1) count, sum(cnt) sum
FROM (
VALUES ('A','green', 5),
('A','yellow', 4),
('A','yellow',2),
('B','blue', 3),
('B','green',4),
('B','blue',1)
) as T (id, color, cnt)
GROUP BY id, color)
GROUP BY id
""").show(truncate=False)
+---+----------------------+----------------------+
|id |count |sum |
+---+----------------------+----------------------+
|B |{"blue":2,"green":1} |{"blue":4,"green":4} |
|A |{"yellow":2,"green":1}|{"yellow":6,"green":5}|
+---+----------------------+----------------------+
Upvotes: 3