Reputation: 63
I want to use PySpark to pivot data from a number of tables but i need to do it in a strange way. Please see below for an example.
Original table:
Vehicle_id | Owner_ID | Vehicle_Buy_Date
--------------------------------------------
1 | 1 | 01/01/2015
1 | 2 | 01/10/2014
2 | 1 | 10/10/2016
End result:
Vehicle_id | Owner_1_Buy_Date | Owner_2_Buy_Date
------------------------------------------------
1 |01/01/2015 |01/10/2014
2 |10/10/2016 |NULL
I understand that this is an unusual question to ask, as this isn't mostly done on database tables.
Is there any way of doing this type of pivoting in PySpark?
Upvotes: 1
Views: 4992
Reputation: 756
Try to use this code to get the required result:
NewDf = df.withColumn('id', F.concat(F.lit('Owner_'), F.col('owner_id'), F.lit('_Buy_Date')).groupBy('Vehicle_id').pivot('id').agg(F.first('Vehicle_Buy_Date'))
Upvotes: 1
Reputation: 10096
The function in pyspark is called pivot
:
import pyspark.sql.functions as psf
df.groupBy("Vehicle_id").pivot("Owner_ID").agg(psf.max("Vehicle_Buy_Date")).show()
+----------+----------+----------+
|Vehicle_id| 1| 2|
+----------+----------+----------+
| 1|01/01/2015|01/10/2014|
| 2|10/10/2016| null|
+----------+----------+----------+
If you know the number of distinct Owner_ID
you can specify it as a list argument in the pivot
function, otherwise it will compute it itself.
Upvotes: 1
Reputation: 4967
It's not a good idea to transpose your data, it's easy to reduce
your data to a structure like this (I this it's a better way) :
Vehicle_id | Owner_ID_Vehicle_Buy_Date_Reduce
--------------------------------------------
1 | [{"Owner_ID":1, Vehicle_Buy_Date: 01/01/2015},{ "Owner_ID": 2, Vehicle_Buy_Date: 01/10/2014}]
2 | [{"Owner_ID":1, Vehicle_Buy_Date: 10/10/2016}]
How do yoy want transpose a structure like this ?
Vehicle_id | Owner_ID | Vehicle_Buy_Date
--------------------------------------------
1 | 1 | 01/01/2015
1 | 2 | 01/10/2014
1 | 3 | 01/10/2013
2 | 4 | 10/10/2016
... | ... | ...
9 | 123 | 10/10/2017
Upvotes: 0