Reputation: 149
I have below dataset:
{
"col1": "val1",
"col2": {
"key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
}
}
with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true))
.
I want to convert col2
to below format:
{
"col1": "val1",
"col2": {
"key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
"key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
}
}
The updated dataset schema will be as below:
StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))
Upvotes: 0
Views: 1932
Reputation: 10382
Try below code It will work in spark 2.4.7
Creating DataFrame with sample data.
scala> val df = Seq(
("val1",Map(
"key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]
Steps:
map_keys
), values (map_values
) into different arrays.map_from_arrays
function to combine keys & values from the above steps to create Map[String, Struct]
scala>
val finalDF = df
.withColumn(
"col2_new",
map_from_arrays(
map_keys($"col2"),
expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
)
)
Printing Schema
finalDF.printSchema
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- col2_new: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- SubCol1: string (nullable = true)
| | |-- SubCol2: string (nullable = true)
Printing Final Output
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2 |col2_new |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
Upvotes: 1
Reputation: 42412
You can use transform_values
on the map column:
val df2 = df.withColumn(
"col2",
expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)
Upvotes: 2