LUZO
LUZO

Reputation: 1029

pivot in PYSPARKSQL

i have a requirement to use pivot on below table.

id,week,score
102,1,96
101,1,138
102,1,37
101,1,59
101,2,282
102,2,212
102,2,78
101,2,97
102,3,60
102,3,123
101,3,220
101,3,87

output

      1         2         3

101   138,59   282,97     220,87
102   96,37    212,78     123,60

here i need to sort score

i have tried below code but it only works when there is a single record on a particular id

df.groupBy("id").pivot("week").agg(first("score"))

Upvotes: 0

Views: 359

Answers (2)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Equivalent python code for pyspark for the scala answer posted by Prasad Khode is as below

from pyspark.sql import functions as F
df.groupBy("id").pivot("week").agg(F.collect_list("score")).show()

If you check the api document you can see

collect_list(Column e)
Aggregate function: returns a list of objects with duplicates.

You can use collect_set as well which will give you the same output with duplicates removed.

df.groupBy("id").pivot("week").agg(F.collect_set("score")).show()

api document says the following

collect_set(Column e)
Aggregate function: returns a set of objects with duplicate elements eliminated.

Upvotes: 1

Prasad Khode
Prasad Khode

Reputation: 6739

Instead of first you should be collecting all the values using collect_list, this will give you the result in list

import org.apache.spark.sql.functions._

df.groupBy("id").pivot("week").agg(collect_list("score")).show()

output :

+---+---------+---------+---------+
|id |1        |2        |3        |
+---+---------+---------+---------+
|101|[138, 59]|[282, 97]|[220, 87]|
|102|[96, 37] |[212, 78]|[60, 123]|
+---+---------+---------+---------+

Upvotes: 3

Related Questions