Reputation:
i have a requirement as below
dataframe
id code
R101,GTR001
R201,RTY987
R301,KIT158
R201,PLI564
R101,MJU098
R301,OUY579
There can be many code
s per id
(not only two).
The expected output should be as below.
id col1 col2 col3 col4 col5 col6
R101 GTR001 MJU098 null null null null
R201 null null RTY987 PLI564 null null
R301 null null null null KIT158 OUY579
Here columns for particular id
depends on the number of code
s assigned to that id
, i.e. under col1
and col2
code for R101
should be populated, under col3
and col4
code for R201
should be populated and the same goes for remaining ids.
Upvotes: 1
Views: 2430
Reputation: 5870
You can try ranking the code field based on id and use the rank for pivoting.Hope this helps,
df = spark.createDataFrame([('R101','GTR001'),('R201','RTY987'),('R301','KIT158'),('R201','PLI564'),('R101','MJU098'),('R301','OUY579')],['id','code'])
df.show()
+----+------+
| id| code|
+----+------+
|R101|GTR001|
|R201|RTY987|
|R301|KIT158|
|R201|PLI564|
|R101|MJU098|
|R301|OUY579|
+----+------+
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","code")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('code')).show()
+----+------+------+------+------+------+------+
| id| col_1| col_2| col_3| col_4| col_5| col_6|
+----+------+------+------+------+------+------+
|R101|GTR001|MJU098| null| null| null| null|
|R201| null| null|PLI564|RTY987| null| null|
|R301| null| null| null| null|KIT158|OUY579|
+----+------+------+------+------+------+------+
Upvotes: 4