Reputation: 103
I have a dataframe that contains a Map column of and an id column.
key1 -> value1, key2 -> value2
key1 -> value3, key2 -> value4
I want to have as a result a dataframe like that:
id key1 key2
1 value1 value2
2 value3 value4
Thanks for your help.
Upvotes: 1
Views: 134
Reputation: 2821
I assume you are talking about Spark DataFrame. In that case, you can use the map
method of the DataFrame to extract out the values you want. Here is an example using spark-shell
(which automatically imports many of the implicit methods).
Note that toDF
is used twice, once to load the sequence from built-in data structures, and another time to rename the columns in the new DataFrame obtained from the map
method of the original DataFrame.
The show
method is called to display "before" and "after"
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val m = Map(1-> Map("key1" -> "v1", "key2" -> "v2"), 2 -> Map("key1" -> "v3", "key2" -> "v4"))
m: scala.collection.immutable.Map[Int,scala.collection.immutable.Map[String,String]] = Map(1 -> Map(key1 -> v1, key2 -> v2), 2 -> Map(key1 -> v3, key2 -> v4))
scala> val df = m.toSeq.toDF("id", "map_value")
df: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> df.show()
+---+--------------------+
| id| map_value|
+---+--------------------+
| 1|[key1 -> v1, key2...|
| 2|[key1 -> v3, key2...|
+---+--------------------+
scala> val get_map:Function1[Row, Map[String,String]] = r => r.getAs[Map[String, String]]("map_value")
get_map: org.apache.spark.sql.Row => Map[String,String] = <function1>
scala> df.map(r => (r.getAs[Int]("id"), get_map(r).get("key1"), get_map(r).get("key2"))).toDF("id", "val1", "val2").show()
+---+----+----+
| id|val1|val2|
+---+----+----+
| 1| v1| v2|
| 2| v3| v4|
+---+----+----+
Edit:
This answers how to address a variable number of columns. Here, N is the number of columns plus one (so there are 7 columns and N is 8). Note that 3 is the number of rows plus one (here there are 2 rows).
It is more convenient to use the select
method of the DataFrame in this case, to avoid having to dynamically create tuples.
scala> val N = 8
N: Int = 8
scala> val map_value:Function1[Int,Map[String,String]] = (i: Int) => Map((for (n <- Range(1, N)) yield (s"k${n}", s"v${n*i}")).toList:_*)
map_value: Int => Map[String,String] = <function1>
scala> val m = Map((for (i <- Range(1, 3)) yield (i, map_value(i))).toList:_*)
m: scala.collection.immutable.Map[Int,Map[String,String]] = Map(1 -> Map(k2 -> v2, k5 -> v5, k6 -> v6, k7 -> v7, k1 -> v1, k4 -> v4, k3 -> v3), 2 -> Map(k2 -> v4, k5 -> v10, k6 -> v12, k7 -> v14, k1 -> v2, k4 -> v8, k3 -> v6))
scala> val df0 = m.toSeq.toDF("id", "map_value")
df0: org.apache.spark.sql.DataFrame = [id: int, map_value: map<string,string>]
scala> val column_names:List[String] = (for (n <- Range(1, N)) yield (s"map_value.k${n}")).toList
column_names: List[String] = List(id, map_value.k1, map_value.k2, map_value.k3, map_value.k4, map_value.k5, map_value.k6, map_value.k7)
scala> df0.select("id", column_names:_*).show()
+---+---+---+---+---+---+---+---+
| id| k1| k2| k3| k4| k5| k6| k7|
+---+---+---+---+---+---+---+---+
| 1| v1| v2| v3| v4| v5| v6| v7|
| 2| v2| v4| v6| v8|v10|v12|v14|
+---+---+---+---+---+---+---+---+
Upvotes: 1