adding values from two different rows into one using pyspark

Question

I have two rows with the exact same data but columns changing between those two rows:

id	product	class	cost
1	table	large	5.12
1	table	medium	2.20

so I'm trying to get the following:

id	product	class	cost
1	table	large, Medium	7.32

I'm currently using the following code to get this:

df.groupBy("id", "product").agg(collect_list("class"),
    (
       F.sum("cost")
    ).alias("Sum")

The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this

id	product	class	cost
1	table	large, large	10.24

this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:

df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))

will it be the same to apply the F.array_join() function ?

adding values from two different rows into one using pyspark

Answers (1)

Related Questions