Yugesha Sapte
Yugesha Sapte

Reputation: 105

How to transpose data in pyspark for multiple different columns

I am trying to transpose data in pyspark. I was able to transpose using a single column. However, with multiple columns I am not sure how to pass parameters to explode function.

Input format:

enter image description here

Output Format :

enter image description here

Can someone please hint me with any example or reference? Thanks in advance.

Upvotes: 0

Views: 147

Answers (2)

dsk
dsk

Reputation: 2011

A cleaned PySpark version of this

from pyspark.sql import functions as F
df_a = spark.createDataFrame([(1,'xyz','MS','abc','Phd','pqr','BS'),(2,"POR","MS","ABC","Phd","","")],[ 

"id","Education1CollegeName","Education1Degree","Education2CollegeName","Education2Degree","Education3CollegeName","Education3Degree"])

+---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
| id|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|

    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+
    |  1|                  xyz|              MS|                  abc|             Phd|                  pqr|              BS|
    |  2|                  POR|              MS|                  ABC|             Phd|                     |                |
    +---+---------------------+----------------+---------------------+----------------+---------------------+----------------+

Code -

df = df_a.selectExpr("id", "stack(3, Education1CollegeName, Education1Degree,Education2CollegeName, Education2Degree,Education3CollegeName, Education3Degree) as (B, C)")

+---+---+---+
| id|  B|  C|
+---+---+---+
|  1|xyz| MS|
|  1|abc|Phd|
|  1|pqr| BS|
|  2|POR| MS|
|  2|ABC|Phd|
|  2|   |   |
+---+---+---+

Upvotes: 0

Som
Som

Reputation: 6338

use stack to transpose as below (spark>=2.4)-

Load the test data

val data =
      """
        |PersonId | Education1CollegeName | Education1Degree | Education2CollegeName | Education2Degree |Education3CollegeName | Education3Degree
        | 1 | xyz | MS | abc | Phd | pqr | BS
        |  2 | POR | MS | ABC | Phd | null | null
      """.stripMargin
    val stringDS1 = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString("|"))
      .toSeq.toDS()
    val df1 = spark.read
      .option("sep", "|")
      .option("inferSchema", "true")
      .option("header", "true")
      .option("nullValue", "null")
      .csv(stringDS1)
    df1.show(false)
    df1.printSchema()

    /**
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |PersonId|Education1CollegeName|Education1Degree|Education2CollegeName|Education2Degree|Education3CollegeName|Education3Degree|
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      * |1       |xyz                  |MS              |abc                  |Phd             |pqr                  |BS              |
      * |2       |POR                  |MS              |ABC                  |Phd             |null                 |null            |
      * +--------+---------------------+----------------+---------------------+----------------+---------------------+----------------+
      *
      * root
      * |-- PersonId: integer (nullable = true)
      * |-- Education1CollegeName: string (nullable = true)
      * |-- Education1Degree: string (nullable = true)
      * |-- Education2CollegeName: string (nullable = true)
      * |-- Education2Degree: string (nullable = true)
      * |-- Education3CollegeName: string (nullable = true)
      * |-- Education3Degree: string (nullable = true)
      */

Un-pivot the table using stack


    df1.selectExpr("PersonId",
      "stack(3, Education1CollegeName, Education1Degree, Education2CollegeName, Education2Degree, " +
        "Education3CollegeName, Education3Degree) as (CollegeName, EducationDegree)")
      .where("CollegeName is not null and EducationDegree is not null")
      .show(false)

    /**
      * +--------+-----------+---------------+
      * |PersonId|CollegeName|EducationDegree|
      * +--------+-----------+---------------+
      * |1       |xyz        |MS             |
      * |1       |abc        |Phd            |
      * |1       |pqr        |BS             |
      * |2       |POR        |MS             |
      * |2       |ABC        |Phd            |
      * +--------+-----------+---------------+
      */

Upvotes: 2

Related Questions