Vincent Chalmel
Vincent Chalmel

Reputation: 652

Pyspark aggregate a StructType column as an Array of its elements for each line

I'm trying to do something that seems pretty much straightforward but somehow cannot figure how to do it with pyspark.

I have a df with two columns (to simplify) 'id' and 'strcol', with possible duplicates ids

I want to do a df.groupBy('id') that would return for each id an array of the strcol values

simple exemple :

|--id--|--strCol--|
|   a  |  {'a':1} |
|   a  |  {'a':2} |
|   b  |  {'b':3} |
|   b  |  {'b':4} |
|------|----------|
would become
|--id--|-------aggsStr------|
|   a  |  [{'a':1},{'a':2}] |
|   b  |  [{'b':3},{'b':4}] |
|------|--------------------|

I tried to use apply with a pandas udf but it seems to refuse to return arrays. (or maybe I didn't use it correctly)

Upvotes: 1

Views: 3968

Answers (1)

dataista
dataista

Reputation: 3457

You can use collect_list from the pyspark.sql.functions module:

from pyspark.sql import functions as F
agg = df.groupby("id").agg(F.collect_list("strCol"))

A fully functional example:

import pandas as pd
from pyspark.sql import functions as F

data =  {'id': ['a', 'a', 'b', 'b'], 'strCol': [{'a':1}, {'a':2}, {'b':3}, {'b':4}]}

df_aux = pd.DataFrame(data)

# df type: DataFrame[id: string, strCol: map<string,bigint>]
df = spark.createDataFrame(df_aux) 


# agg type: # DataFrame[id: string, collect_list(strCol): array<map<string,bigint>>]
agg = df.groupby("id").agg(F.collect_list("strCol")) 

Hope this helped!

Upvotes: 2

Related Questions