Deb
Deb

Reputation: 539

Converting (casting) columns into rows in Pyspark

I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.

 id pred    prob      rank
485 9716    0.19205872  1
729 9767    0.19610429  1
729 9716    0.186840048 2
729 9748    0.173447074 3
818 9731    0.255104463 1
818 9748    0.215499913 2
818 9716    0.207307154 3

I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).

id  pred_1  prob_1      pred_2  prob_2     pred_3   prob_3
485 9716    0.19205872              
729 9767    0.19610429  9716    0.186840048 9748    0.173447074
818 9731    0.255104463 9748    0.215499913 9716    0.207307154

I am not able to figure out how to o it in Pyspark

Sample code for input data creation:

# Loading the requisite packages 
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit        
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()

Upvotes: 1

Views: 303

Answers (1)

过过招
过过招

Reputation: 4189

This is the pivot on multiple columns problem.Try:

import pyspark.sql.functions as F

df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)

Upvotes: 2

Related Questions