Pyspark aggregate a StructType column as an Array of its elements for each line

Question

I'm trying to do something that seems pretty much straightforward but somehow cannot figure how to do it with pyspark.

I have a df with two columns (to simplify) 'id' and 'strcol', with possible duplicates ids

I want to do a df.groupBy('id') that would return for each id an array of the strcol values

simple exemple :

|--id--|--strCol--|
|   a  |  {'a':1} |
|   a  |  {'a':2} |
|   b  |  {'b':3} |
|   b  |  {'b':4} |
|------|----------|
would become
|--id--|-------aggsStr------|
|   a  |  [{'a':1},{'a':2}] |
|   b  |  [{'b':3},{'b':4}] |
|------|--------------------|

I tried to use apply with a pandas udf but it seems to refuse to return arrays. (or maybe I didn't use it correctly)

dataista · Accepted Answer

You can use collect_list from the pyspark.sql.functions module:

from pyspark.sql import functions as F
agg = df.groupby("id").agg(F.collect_list("strCol"))

A fully functional example:

import pandas as pd
from pyspark.sql import functions as F

data =  {'id': ['a', 'a', 'b', 'b'], 'strCol': [{'a':1}, {'a':2}, {'b':3}, {'b':4}]}

df_aux = pd.DataFrame(data)

# df type: DataFrame[id: string, strCol: map]
df = spark.createDataFrame(df_aux) 


# agg type: # DataFrame[id: string, collect_list(strCol): array>]
agg = df.groupby("id").agg(F.collect_list("strCol"))

Hope this helped!

Pyspark aggregate a StructType column as an Array of its elements for each line

Answers (1)

Related Questions