lookingglass
lookingglass

Reputation: 39

pyspark using agg to concat string after groupBy

In pandas dataframe, I am able to do

df2 = df.groupBy('name').agg({'id': 'first', 'grocery': ','.join})

from

name        id        grocery
Mike        01        Apple
Mike        01        Orange
Kate        99        Beef
Kate        99        Wine

to

name        id        grocery
Mike        01        Apple,Orange
Kate        99        Beef,Wine

since id is the same across multiple rows for the same person, I just took the first one for each person, and concat the grocery.

I can't seem to make this work in pyspark. How can I do the same thing in pyspark? I want the grocery to be string instead of list

Upvotes: 1

Views: 4446

Answers (1)

akuiper
akuiper

Reputation: 214957

Use collect_list to collect elements into a list and then join the list as string with concat_ws:

import pyspark.sql.functions as f

df.groupBy("name")
  .agg(
      f.first("id").alias("id"), 
      f.concat_ws(",", f.collect_list("grocery")).alias("grocery")
   ).show()

#+----+---+------------+
#|name| id|     grocery|
#+----+---+------------+
#|Kate| 99|   Beef,Wine|
#|Mike| 01|Apple,Orange|
#+----+---+------------+

Upvotes: 9

Related Questions