user8510536
user8510536

Reputation:

Pivoting Data-frame in PYSPARK

i have a requirement as below

dataframe

id   code
R101,GTR001
R201,RTY987
R301,KIT158
R201,PLI564
R101,MJU098
R301,OUY579

There can be many codes per id (not only two).

The expected output should be as below.

id    col1  col2   col3   col4   col5   col6

R101 GTR001 MJU098 null   null   null   null   
R201 null   null   RTY987 PLI564 null   null   
R301 null   null   null   null   KIT158 OUY579

Here columns for particular id depends on the number of codes assigned to that id, i.e. under col1 and col2 code for R101 should be populated, under col3 and col4 code for R201 should be populated and the same goes for remaining ids.

Upvotes: 1

Views: 2430

Answers (1)

Suresh
Suresh

Reputation: 5870

You can try ranking the code field based on id and use the rank for pivoting.Hope this helps,

 df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
 df.show()
   +----+------+
   |  id|  code|
   +----+------+
   |R101|GTR001|
   |R201|RTY987|
   |R301|KIT158|
   |R201|PLI564|
   |R101|MJU098|
   |R301|OUY579|
   +----+------+

 from pyspark.sql import functions as F
 from pyspark.sql import Window

 df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
 df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()

    +----+------+------+------+------+------+------+
   |  id| col_1| col_2| col_3| col_4| col_5| col_6|
   +----+------+------+------+------+------+------+
   |R101|GTR001|MJU098|  null|  null|  null|  null|
   |R201|  null|  null|PLI564|RTY987|  null|  null|
   |R301|  null|  null|  null|  null|KIT158|OUY579|
   +----+------+------+------+------+------+------+ 

Upvotes: 4

Related Questions