How to unpivot a large spark dataframe?

Question

I have seen a few solutions to unpivot a spark dataframe when the number of columns is reasonably low and that the columns' names can be hardcoded. Do you have a scalable solution to unpivot a dataframe with numerous columns?

Below is a toy problem.

Input:

  val df = Seq(
    (1,1,1,0),
    (2,0,0,1)    
  ).toDF("ID","A","B","C")

+---+--------+----+
| ID|  A | B | C  |
+---+--------+-----
|  1|  1 | 1 | 0  |
|  2|  0 | 0 | 1  |
+---+----------+--+

expected result:

+---+-----+-----+
| ID|names|count|
+---+-----------|
|  1|  A  |  1  |
|  1|  B  |  1  |
|  1|  C  |  0  |
|  2|  A  |  0  |
|  2|  B  |  0  |
|  2|  C  |  1  |
+---+-----------+

The solution should be applicable to datasets with N columns to unpivot, where N is large (say 100 columns).

Mikel San Vicente · Accepted Answer

This should work, I am assuming you know the list of columns that you want to unpivot on

import org.apache.spark.sql.{functions => func, _}

val df = Seq(
    (1,1,1,0),
    (2,0,0,1)    
  ).toDF("ID","A","B","C")

val cols = Seq("A", "B", "C")

df.select(
    $"ID",
    func.explode(
        func.array(
            cols.map(
                col =>
                    func.struct(    
                        func.lit(col).alias("names"),
                        func.col(col).alias("count")
                    )
            ): _*
        )
    ).alias("v")
)
.selectExpr("ID", "v.*")

How to unpivot a large spark dataframe?

Answers (2)

Related Questions