Reputation: 181
I have two rows with the exact same data but columns changing between those two rows:
id | product | class | cost |
---|---|---|---|
1 | table | large | 5.12 |
1 | table | medium | 2.20 |
so I'm trying to get the following:
id | product | class | cost |
---|---|---|---|
1 | table | large, Medium | 7.32 |
I'm currently using the following code to get this:
df.groupBy("id", "product").agg(collect_list("class"),
(
F.sum("cost")
).alias("Sum")
The issue with this snippet code is that when doing the grouping is the first value it finds in class, and the addition doesn't seem to be correct (I'm not sure if it because is getting the first value and adding it the times it encounters class on that same id throughout the rows), so I'm getting something like this
id | product | class | cost |
---|---|---|---|
1 | table | large, large | 10.24 |
this is another snippet code I used, so I could get all my other fields while performing the addition on those two columns:
df.withColumn("total", F.sum("cost").over(Window.partitionBy("id")))
will it be the same to apply the F.array_join() function ?
Upvotes: 0
Views: 143
Reputation: 4244
You need to use the array_join function to join the results of collect_list with commas (,).
df = df.groupBy('id', 'product').agg(
F.array_join(F.collect_list('class'), ',').alias('class'),
F.sum('cost').alias('cost')
)
Upvotes: 1