Reputation: 102
Im ingesting JSON files into spark and i have come across an object as below in the nested JSON from the file
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1"},
{"nk2" :"nv2" },
{"nk3" :"nv3" }
]
}
After reading it in spark, it is changing into below format:
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1", "nk2" :null, "nk3" :null},
{"nk1" :null, "nk2" :"nv2", "nk3" :null},
{"nk1" :null, "nk2" :null, "nk3" :"nv3"}
]
}
I need them as columns in the spark dataframe
"key1" | "nk1" | "nk2" | "nk3" |
---|---|---|---|
"v1" | "kv1" | "kv2" | "kv3" |
Please help me with any solution for this. I'm thinking to convert this to string and use regex. Is there any better solution?
Upvotes: 0
Views: 3147
Reputation: 42342
You can explode the array and pivot key2:
import pyspark.sql.functions as F
df2 = df.select(
F.col('data.key1').alias('key1'),
F.explode('data.key2').alias('key2')
).select(
'key1',
F.map_keys('key2')[0].alias('key'),
F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))
df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
| v1|nv1|nv2|nv3|
+----+---+---+---+
Upvotes: 1