Wendy Velasquez
Wendy Velasquez

Reputation: 181

adding values from two different rows into one using pyspark

I have two rows with the exact same data but columns changing between those two rows:

id product class cost
1 table large 5.12
1 table medium 2.20

so I'm trying to get the following:

id product class cost
1 table large, Medium 7.32

I'm currently using the following code to get this:

df.groupBy("id", "product").agg(collect_list("class"),
    (
       F.sum("cost")
    ).alias("Sum") 

The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this

id product class cost
1 table large, large 10.24

this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:

df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))

will it be the same to apply the F.array_join() function ?

Upvotes: 0

Views: 143

Answers (1)

过过招
过过招

Reputation: 4244

You need to use the array_join function to join the results of collect_list with commas (,).

df = df.groupBy('id', 'product').agg(
    F.array_join(F.collect_list('class'), ',').alias('class'),
    F.sum('cost').alias('cost')
)

Upvotes: 1

Related Questions