Reputation: 1029
i have a requirement to use pivot on below table.
id,week,score
102,1,96
101,1,138
102,1,37
101,1,59
101,2,282
102,2,212
102,2,78
101,2,97
102,3,60
102,3,123
101,3,220
101,3,87
output
1 2 3
101 138,59 282,97 220,87
102 96,37 212,78 123,60
here i need to sort score
i have tried below code but it only works when there is a single record on a particular id
df.groupBy("id").pivot("week").agg(first("score"))
Upvotes: 0
Views: 359
Reputation: 41957
Equivalent python code for pyspark for the scala answer posted by Prasad Khode is as below
from pyspark.sql import functions as F
df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()
If you check the api document you can see
collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.
You can use collect_set
as well which will give you the same output with duplicates removed.
df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()
api document says the following
collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.
Upvotes: 1
Reputation: 6739
Instead of first
you should be collecting all the values using collect_list
, this will give you the result in list
import org.apache.spark.sql.functions._
df.groupBy("id").pivot("week").agg(collect_list("score")).show()
output :
+---+---------+---------+---------+
|id |1 |2 |3 |
+---+---------+---------+---------+
|101|[138, 59]|[282, 97]|[220, 87]|
|102|[96, 37] |[212, 78]|[60, 123]|
+---+---------+---------+---------+
Upvotes: 3