Pivot in pySpark

Question

I have a dataframe:

student_id class score
1 A 6
1 B 7
1 C 8

I would like to divide the class score into 3 columns so the above dataframe should become:

student_id class_A_score class_B_score class_C_score
1 6 7 8

The idea is to convert A B C into 3 columns.

cph_sto · Accepted Answer

values = [(1,'A',6),(1,'B',7),(1,'C',8)]
df = sqlContext.createDataFrame(values,['student_id','class','score'])
df.show()
+----------+-----+-----+
|student_id|class|score|
+----------+-----+-----+
|         1|    A|    6|
|         1|    B|    7|
|         1|    C|    8|
+----------+-----+-----+
df = df.groupBy(["student_id"]).pivot("class").sum("score")
df.show()
+----------+---+---+---+
|student_id|  A|  B|  C|
+----------+---+---+---+
|         1|  6|  7|  8|
+----------+---+---+---+

Pivot in pySpark

Answers (2)

Related Questions