tak883
tak883

Reputation: 23

Can I aggregate map data as array at pyspark?

I'm trying to aggregate data. It works below.

name id day value
ken 01 02-01 good
ken 01 02-02 error
spark_df

spark_df.groupBy("name", "id").\
    agg(func.collect_list(func.create_map(func.col("day"),func.col("value)))).alias("day_val"))

I could aggregate day_val data as list of map. Like this

[{"day1":"value1"},{"day2":"value2"},{"day3":"value3"},....]

But I want to save it as

{"day1":"value1","day2":"value2","day3":"value3"}

Because on dynamodb I want to use it as not list but map. Can I convert it to array or aggregate it as map?

Thank you.

Upvotes: 2

Views: 522

Answers (2)

blackbishop
blackbishop

Reputation: 32670

Or map_from_arrays:

from pyspark.sql import functions as F

df1 = df.groupBy('name', 'id').agg(
    F.map_from_arrays(
        F.collect_list('day'),
        F.collect_list('value')
    ).alias('day_val')
)

df1.show(truncate=False)

#+----+---+-------------------------------+
#|name|id |day_val                        |
#+----+---+-------------------------------+
#|ken |01 |[02-01 -> good, 02-02 -> error]|
#+----+---+-------------------------------+

Upvotes: 1

mck
mck

Reputation: 42352

You can use map_from_entries:

import pyspark.sql.functions as F

result = df.groupBy('name', 'id').agg(
    F.map_from_entries(
        F.collect_list(
            F.struct('day', 'value')
        )
    ).alias('day_val')
)

result.show(truncate=False)
+----+---+-------------------------------+
|name|id |day_val                        |
+----+---+-------------------------------+
|ken |1  |[02-01 -> good, 02-02 -> error]|
+----+---+-------------------------------+

Upvotes: 1

Related Questions