Robert25
Robert25

Reputation: 103

Convert keys into columns names an values into rows (Map)

I have a dataframe that contains a Map column of and an id column.

key1 -> value1, key2 -> value2
key1 -> value3, key2 -> value4

I want to have as a result a dataframe like that:

id   key1     key2
1    value1   value2
2    value3   value4

Thanks for your help.

Upvotes: 1

Views: 134

Answers (1)

ELinda
ELinda

Reputation: 2821

I assume you are talking about Spark DataFrame. In that case, you can use the map method of the DataFrame to extract out the values you want. Here is an example using spark-shell (which automatically imports many of the implicit methods).

Note that toDF is used twice, once to load the sequence from built-in data structures, and another time to rename the columns in the new DataFrame obtained from the map method of the original DataFrame.

The show method is called to display "before" and "after"

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val m = Map(1-> Map("key1" -> "v1", "key2" -> "v2"), 2 -> Map("key1" -> "v3", "key2" -> "v4"))
m: scala.collection.immutable.Map[Int,scala.collection.immutable.Map[String,String]] = Map(1 -> Map(key1 -> v1, key2 -> v2), 2 -> Map(key1 -> v3, key2 -> v4))

scala> val df = m.toSeq.toDF("id", "map_value")
df: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]

scala> df.show()
+---+--------------------+
| id|           map_value|
+---+--------------------+
|  1|[key1 -> v1, key2...|
|  2|[key1 -> v3, key2...|
+---+--------------------+ 

scala> val get_map:Function1[Row, Map[String,String]] = r => r.getAs[Map[String, String]]("map_value")
get_map: org.apache.spark.sql.Row => Map[String,String] = <function1>

scala> df.map(r => (r.getAs[Int]("id"), get_map(r).get("key1"), get_map(r).get("key2"))).toDF("id", "val1", "val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
|  1|  v1|  v2|
|  2|  v3|  v4|
+---+----+----+

Edit:

This answers how to address a variable number of columns. Here, N is the number of columns plus one (so there are 7 columns and N is 8). Note that 3 is the number of rows plus one (here there are 2 rows).

It is more convenient to use the select method of the DataFrame in this case, to avoid having to dynamically create tuples.

scala> val N = 8
N: Int = 8

scala> val map_value:Function1[Int,Map[String,String]] = (i: Int) => Map((for (n <- Range(1, N)) yield (s"k${n}", s"v${n*i}")).toList:_*)
map_value: Int => Map[String,String] = <function1>

scala> val m = Map((for (i <- Range(1, 3)) yield (i, map_value(i))).toList:_*)
m: scala.collection.immutable.Map[Int,Map[String,String]] = Map(1 -> Map(k2 -> v2, k5 -> v5, k6 -> v6, k7 -> v7, k1 -> v1, k4 -> v4, k3 -> v3), 2 -> Map(k2 -> v4, k5 -> v10, k6 -> v12, k7 -> v14, k1 -> v2, k4 -> v8, k3 -> v6))

scala> val df0 = m.toSeq.toDF("id", "map_value")
df0: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]

scala> val column_names:List[String] = (for (n <- Range(1, N)) yield (s"map_value.k${n}")).toList
column_names: List[String] = List(id, map_value.k1, map_value.k2, map_value.k3, map_value.k4, map_value.k5, map_value.k6, map_value.k7)

scala> df0.select("id", column_names:_*).show()
+---+---+---+---+---+---+---+---+
| id| k1| k2| k3| k4| k5| k6| k7|
+---+---+---+---+---+---+---+---+
|  1| v1| v2| v3| v4| v5| v6| v7|
|  2| v2| v4| v6| v8|v10|v12|v14|
+---+---+---+---+---+---+---+---+

Upvotes: 1

Related Questions