mommomonthewind
mommomonthewind

Reputation: 4650

Pivot in pySpark

I have a dataframe:

student_id class score
1 A 6
1 B 7
1 C 8

I would like to divide the class score into 3 columns so the above dataframe should become:

student_id class_A_score class_B_score class_C_score
1 6 7 8

The idea is to convert A B C into 3 columns.

Upvotes: 1

Views: 152

Answers (2)

Rahul Chawla
Rahul Chawla

Reputation: 1078

This is a classic example of pivot. In pyspark, if df is your dataframe:

new_df = df.groupBy(['student_id']).pivot('class').sum(score)

Databricks has very nice illustration of this at https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Upvotes: 1

cph_sto
cph_sto

Reputation: 7607

values = [(1,'A',6),(1,'B',7),(1,'C',8)]
df = sqlContext.createDataFrame(values,['student_id','class','score'])
df.show()
+----------+-----+-----+
|student_id|class|score|
+----------+-----+-----+
|         1|    A|    6|
|         1|    B|    7|
|         1|    C|    8|
+----------+-----+-----+
df = df.groupBy(["student_id"]).pivot("class").sum("score")
df.show()
+----------+---+---+---+
|student_id|  A|  B|  C|
+----------+---+---+---+
|         1|  6|  7|  8|
+----------+---+---+---+

Upvotes: 1

Related Questions