Reputation: 402

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?

df.show()
+--------------------+--------------------  +
| ids                | vals                 |
+--------------------+--------------------  +
|[id1,id2,id3]       | null                 |
|[id2,id5,id6]       |[WrappedArray(0,2,4)] |
|[id2,id4,id7]       |[WrappedArray(6,8,10)]|

Expected output:

+----+----+
|id1 | id2| ...
+----+----+
|null| 0  | ...
|null| 6  | ...

Upvotes: 3

Answers (1)

maasg

Reputation: 37435

A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.

import org.apache.spark.sql.functions._

val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))

val df = sparkContext.parallelize(data).toDF("ids","values")

val values = df.flatMap{
  case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
  case Row(_, null) => None
}

// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))

// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))

// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))

// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)

This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.

Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.

Upvotes: 3

Reshape spark data frame of key-value pairs with keys as new columns

Answers (1)

Related Questions