Reputation: 105
I am trying to transpose data in pyspark. I was able to transpose using a single column. However, with multiple columns I am not sure how to pass parameters to explode function.
Input format:
Output Format :
Can someone please hint me with any example or reference? Thanks in advance.
Upvotes: 0
Views: 147
Reputation: 2011
A cleaned PySpark version of this
from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[
"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| 1| xyz| MS| abc| Phd| pqr| BS|
| 2| POR| MS| ABC| Phd| | |
+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
Code -
df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")
+---+---+---+
| id| B| C|
+---+---+---+
| 1|xyz| MS|
| 1|abc|Phd|
| 1|pqr| BS|
| 2|POR| MS|
| 2|ABC|Phd|
| 2| | |
+---+---+---+
Upvotes: 0
Reputation: 6338
use stack
to transpose as below (spark>=2.4
)-
val data =
"""
|PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
| 1 | xyz | MS | abc | Phd | pqr | BS
| 2 | POR | MS | ABC | Phd | null | null
""".stripMargin
val stringDS1 = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
.toSeq.toDS()
val df1 = spark.read
.option("sep", "|")
.option("inferSchema", "true")
.option("header", "true")
.option("nullValue", "null")
.csv(stringDS1)
df1.show(false)
df1.printSchema()
/**
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
* |1 |xyz |MS |abc |Phd |pqr |BS |
* |2 |POR |MS |ABC |Phd |null |null |
* +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
*
* root
* |-- PersonId: integer (nullable = true)
* |-- Education1CollegeName: string (nullable = true)
* |-- Education1Degree: string (nullable = true)
* |-- Education2CollegeName: string (nullable = true)
* |-- Education2Degree: string (nullable = true)
* |-- Education3CollegeName: string (nullable = true)
* |-- Education3Degree: string (nullable = true)
*/
df1.selectExpr("PersonId",
"stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
"Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
.where("CollegeName is not null and EducationDegree is not null")
.show(false)
/**
* +--------+-----------+---------------+
* |PersonId|CollegeName|EducationDegree|
* +--------+-----------+---------------+
* |1 |xyz |MS |
* |1 |abc |Phd |
* |1 |pqr |BS |
* |2 |POR |MS |
* |2 |ABC |Phd |
* +--------+-----------+---------------+
*/
Upvotes: 2