Exploding column of JSON array

Question

I'm struggling with an issue of exploding a json array column with Spark.

I have a dataframe that looks like:

+------+------------------------------------------------------------------------+
|id    |struct                                                                  |
+------+------------------------------------------------------------------------+
|  1   |  [{_name: BankAccount, _value: 123456}, {_name: Balance, _value: 500$}]|
|  2   |  [{_name: BankAccount, _value: 098765}, {_name: Balance, _value: 100$}]|
|  3   |  [{_name: BankAccount, _value: 135790}, {_name: Balance, _value: 200$}]|
+------+------------------------------------------------------------------------+

and I want it to be as

+------+------------+--------+
|id    | BankAccount| Balance|
+------+------------+--------+
|  1   |   123456   | 500$   |
|  2   |   098765   | 100$   |
|  3   |   135790   | 200$   |
+------+------------+--------+

Of course it is not really exploding but I couldn't be remotely close to the outcome of what I need.

Thanks for your help!

s.polam · Accepted Answer

Check below code.

From your sample data instead of struct I have used data for simplicity.. :)

 val df = Seq((1,"""[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]"""),(2,"""[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]"""),(3,"""[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]""")).toDF("id","data")

Print Schema of data

scala> df.printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)

Showing Sample Data

scala> df.show(false)
+---+--------------------------------------------------------------------------------+
|id |data                                                                            |
+---+--------------------------------------------------------------------------------+
|1  |[{"_name":"BankAccount","_value":"123456"},{"_name":"Balance","_value": "500$"}]|
|2  |[{"_name":"BankAccount","_value":"098765"},{"_name":"Balance","_value": "100$"}]|
|3  |[{"_name":"BankAccount","_value":"135790"},{"_name":"Balance","_value": "200$"}]|
+---+--------------------------------------------------------------------------------+

Creating schema for json data

scala> val schema = ArrayType(MapType(StringType,StringType))

Using explode, groupBy & pivot to get expected result.

Note - You may need to tune below code little bit as per your requirement.

scala> 

df
.withColumn("data",explode(from_json($"data",schema)))
.select($"id",struct($"data"("_name").as("key"),$"data"("_value").as("value")).as("data"))
.select($"id",$"data.*")
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
.select("id","BankAccount","Balance")
.orderBy($"id".asc)
.show(false)

Final Result

+---+-----------+-------+
|id |BankAccount|Balance|
+---+-----------+-------+
|1  |123456     |500$   |
|2  |098765     |100$   |
|3  |135790     |200$   |
+---+-----------+-------+

Exploding column of JSON array

Answers (1)

Related Questions