DR_S
DR_S

Reputation: 102

Pyspark JSON array of objects into columns

Im ingesting JSON files into spark and i have come across an object as below in the nested JSON from the file

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1"}, 
     {"nk2" :"nv2" }, 
     {"nk3" :"nv3" } 
  ] 
}

After reading it in spark, it is changing into below format:

"data": {
  "key1" :"v1" 
  "key2" : [
     {"nk1" :"nv1", "nk2" :null, "nk3" :null}, 
     {"nk1" :null, "nk2" :"nv2", "nk3" :null}, 
     {"nk1" :null, "nk2" :null, "nk3" :"nv3"} 
  ] 
}

I need them as columns in the spark dataframe

"key1" "nk1" "nk2" "nk3"
"v1" "kv1" "kv2" "kv3"

Please help me with any solution for this. I'm thinking to convert this to string and use regex. Is there any better solution?

Upvotes: 0

Views: 3147

Answers (1)

mck
mck

Reputation: 42342

You can explode the array and pivot key2:

import pyspark.sql.functions as F

df2 = df.select(
    F.col('data.key1').alias('key1'), 
    F.explode('data.key2').alias('key2')
).select(
    'key1', 
    F.map_keys('key2')[0].alias('key'), 
    F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))

df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
|  v1|nv1|nv2|nv3|
+----+---+---+---+

Upvotes: 1

Related Questions