Reputation: 11
In spark-scala, I need to create data frame by using nested structure json file
Issue scenerio :
I am having an json input with complex nested structure . Every day there is chance that some of the keys will not be available on any of the records(keys are optional).some of the keys may not present on day1 and may present in day2 But I am expecting a generic output with all columns expected inspite keys are missing. I couldn't use withcolumn function and apply default Vakue since if the key present on a day, that corresponding value should be taken.if I do select , it is failing unable to resolve error since key may not be present on any day Please advise me any solution
Upvotes: 0
Views: 280
Reputation: 6964
This is a very common problem in the data ingestion. Most of the data require schema evolution i.e the schema changes with time.
There are essentially two options.
Pass the schema while reading the dataframe : This works well when you know the superset of all the schema. Spark will make the missing columns in one day's data as NULL.
Evolve the schema using spark schema merging : Spark does schema merging by default. You can union the existing snapshot and with the incoming delta and read as json.
val df1 = spark.read.json("/path/snapshot")
val df2 = spark.read.json("/path/delta")
spark.read.json(df1.toJSON.union(df2.toJSON))
Upvotes: 1