Spark Scala - Split Array of Structs into Dataframe Columns

Question

I have a nested source json file that contains an array of structs. The number of structs varies greatly from row to row and I would like to use Spark (scala) to dynamically create new dataframe columns from the key/values of the struct where the key is the column name and the value is the column value.

Example Minified json record

{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}

dataframe schema

scala> val df = spark.read.json("file:///tmp/nested_test.json")
root
 |-- key1: struct (nullable = true)
 |    |-- key2: struct (nullable = true)
 |    |    |-- key3: string (nullable = true)
 |    |    |-- key4: string (nullable = true)
 |    |    |-- key5: struct (nullable = true)
 |    |    |    |-- key6: string (nullable = true)
 |    |    |    |-- key7: string (nullable = true)
 |    |    |    |-- values: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)

Whats been done so far

df.select(
    ($"key1.key2.key3").as("key3"),
    ($"key1.key2.key4").as("key4"),
    ($"key1.key2.key5.key6").as("key6"),
    ($"key1.key2.key5.key7").as("key7"),
    ($"key1.key2.key5.values").as("values")).
    show(truncate=false)

+----+----+----+----+----------------------------------------------------------------------------+
|key3|key4|key6|key7|values                                                                      |
+----+----+----+----+----------------------------------------------------------------------------+
|AK  |EU  |001 |N   |[[valuesColumn1, 9.876], [valuesColumn2, 1.2345], [valuesColumn3, 8.675309]]|
+----+----+----+----+----------------------------------------------------------------------------+

There is an array of 3 structs here but the 3 structs need to be spilt into 3 separate columns dynamically (the number of 3 can vary greatly), and I am not sure how to do it.

Sample Desired output

Notice that there were 3 new columns produced for each of the array elements within the values array.

+----+----+----+----+-----------------------------------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-----------------------------------------+
|AK  |EU  |001 |N   |9.876        |1.2345        |8.675309    |
+----+----+----+----+-----------------------------------------+

Reference

I believe that the desired solution is something similar to what was discussed in this SO post but with 2 main differences:

The number of columns is hardcoded to 3 in the SO post but in my circumstance, the number of array elements is unknown
The column names need to be driven by the name column and the column value by the value.

...
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)

user9074332 · Accepted Answer

I found that this approach performed much better and was easier to understand using an explode and pivot:

val json = """{"key1":{"key2":{"key3":"AK","key4":"EU","key5":{"key6":"001","key7":"N","values":[{"name":"valuesColumn1","value":"9.876"},{"name":"valuesColumn2","value":"1.2345"},{"name":"valuesColumn3","value":"8.675309"}]}}}}"""

val df = spark.read.json(Seq(json).toDS())

// schema
df.printSchema
root
 |-- key1: struct (nullable = true)
 |    |-- key2: struct (nullable = true)
 |    |    |-- key3: string (nullable = true)
 |    |    |-- key4: string (nullable = true)
 |    |    |-- key5: struct (nullable = true)
 |    |    |    |-- key6: string (nullable = true)
 |    |    |    |-- key7: string (nullable = true)
 |    |    |    |-- values: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)

// create final df
val finalDf = df.
    select(
      $"key1.key2.key3".as("key3"),
      $"key1.key2.key4".as("key4"),
      $"key1.key2.key5.key6".as("key6"),
      $"key1.key2.key5.key7".as("key7"),
      explode($"key1.key2.key5.values").as("values")
    ).
    groupBy(
      $"key3", $"key4", $"key6", $"key7"
    ).
    pivot("values.name").
    agg(min("values.value")).alias("values.name")

// result
finalDf.show
+----+----+----+----+-------------+-------------+-------------+
|key3|key4|key6|key7|valuesColumn1|valuesColumn2|valuesColumn3|
+----+----+----+----+-------------+-------------+-------------+
|  AK|  EU| 001|   N|        9.876|       1.2345|     8.675309|
+----+----+----+----+-------------+-------------+-------------+