Reputation: 39
In pandas dataframe, I am able to do
df2 = df.groupBy('name').agg({'id': 'first', 'grocery': ','.join})
from
name id grocery
Mike 01 Apple
Mike 01 Orange
Kate 99 Beef
Kate 99 Wine
to
name id grocery
Mike 01 Apple,Orange
Kate 99 Beef,Wine
since id is the same across multiple rows for the same person, I just took the first one for each person, and concat the grocery.
I can't seem to make this work in pyspark. How can I do the same thing in pyspark? I want the grocery to be string instead of list
Upvotes: 1
Views: 4446
Reputation: 214957
Use collect_list
to collect elements into a list and then join the list as string with concat_ws
:
import pyspark.sql.functions as f
df.groupBy("name")
.agg(
f.first("id").alias("id"),
f.concat_ws(",", f.collect_list("grocery")).alias("grocery")
).show()
#+----+---+------------+
#|name| id| grocery|
#+----+---+------------+
#|Kate| 99| Beef,Wine|
#|Mike| 01|Apple,Orange|
#+----+---+------------+
Upvotes: 9