Reputation: 63

PySpark Pivoting

I want to use PySpark to pivot data from a number of tables but i need to do it in a strange way. Please see below for an example.

Original table:

Vehicle_id | Owner_ID    | Vehicle_Buy_Date
--------------------------------------------
1          | 1           | 01/01/2015
1          | 2           | 01/10/2014
2          | 1           | 10/10/2016

End result:

Vehicle_id | Owner_1_Buy_Date | Owner_2_Buy_Date
------------------------------------------------
1          |01/01/2015        |01/10/2014
2          |10/10/2016        |NULL

I understand that this is an unusual question to ask, as this isn't mostly done on database tables.

Is there any way of doing this type of pivoting in PySpark?

Upvotes: 1

Answers (3)

Rahul Gupta

Reputation: 756

Try to use this code to get the required result:

NewDf = df.withColumn('id', F.concat(F.lit('Owner_'), F.col('owner_id'), F.lit('_Buy_Date')).groupBy('Vehicle_id').pivot('id').agg(F.first('Vehicle_Buy_Date'))

Upvotes: 1

MaFF

Reputation: 10096

The function in pyspark is called pivot:

import pyspark.sql.functions as psf
df.groupBy("Vehicle_id").pivot("Owner_ID").agg(psf.max("Vehicle_Buy_Date")).show()

    +----------+----------+----------+
    |Vehicle_id|         1|         2|
    +----------+----------+----------+
    |         1|01/01/2015|01/10/2014|
    |         2|10/10/2016|      null|
    +----------+----------+----------+

If you know the number of distinct Owner_ID you can specify it as a list argument in the pivot function, otherwise it will compute it itself.

Upvotes: 1

Indent

Reputation: 4967

It's not a good idea to transpose your data, it's easy to reduce your data to a structure like this (I this it's a better way) :

Vehicle_id | Owner_ID_Vehicle_Buy_Date_Reduce
--------------------------------------------
1          | [{"Owner_ID":1, Vehicle_Buy_Date: 01/01/2015},{ "Owner_ID": 2,   Vehicle_Buy_Date: 01/10/2014}]
2          | [{"Owner_ID":1, Vehicle_Buy_Date: 10/10/2016}]

How do yoy want transpose a structure like this ?

Vehicle_id | Owner_ID    | Vehicle_Buy_Date
--------------------------------------------
1          | 1           | 01/01/2015
1          | 2           | 01/10/2014
1          | 3           | 01/10/2013
2          | 4           | 10/10/2016
...        | ...         | ...
9          | 123         | 10/10/2017

Upvotes: 0

PySpark Pivoting

Answers (3)

Related Questions