user2167322
user2167322

Reputation: 149

Apache Spark - Transform Map[String, String] to Map[String, Struct]

I have below dataset:

{
  "col1": "val1",
  "col2": {
    "key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
    "key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
  }
}

with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true)).

I want to convert col2 to below format:

{
  "col1": "val1",
  "col2": {
    "key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
    "key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
  }
}

The updated dataset schema will be as below:

StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))

Upvotes: 0

Views: 1932

Answers (2)

s.polam
s.polam

Reputation: 10382

Try below code It will work in spark 2.4.7

Creating DataFrame with sample data.

scala> val df = Seq(
("val1",Map(
            "key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
            "key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")

df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]

Steps:

  1. Extract map keys (map_keys), values (map_values) into different arrays.
  2. Convert map values into desired output. i.e. Struct
  3. Use map_from_arrays function to combine keys & values from the above steps to create Map[String, Struct]
scala> 
val finalDF = df
.withColumn(
            "col2_new",
            map_from_arrays(
                map_keys($"col2"),
                expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
            )
)

Printing Schema

finalDF.printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- col2_new: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- SubCol1: string (nullable = true)
 |    |    |-- SubCol2: string (nullable = true)

Printing Final Output

+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2                                                                                      |col2_new                                    |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+

Upvotes: 1

mck
mck

Reputation: 42412

You can use transform_values on the map column:

val df2 = df.withColumn(
    "col2", 
    expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)

Upvotes: 2

Related Questions