SadamHussain M
SadamHussain M

Reputation: 11

Handling json file in spark

In spark-scala, I need to create data frame by using nested structure json file

Issue scenerio :

I am having an json input with complex nested structure . Every day there is chance that some of the keys will not be available on any of the records(keys are optional).some of the keys may not present on day1 and may present in day2 But I am expecting a generic output with all columns expected inspite keys are missing. I couldn't use withcolumn function and apply default Vakue since if the key present on a day, that corresponding value should be taken.if I do select , it is failing unable to resolve error since key may not be present on any day Please advise me any solution

Upvotes: 0

Views: 280

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6964

This is a very common problem in the data ingestion. Most of the data require schema evolution i.e the schema changes with time.

There are essentially two options.

  1. Pass the schema while reading the dataframe : This works well when you know the superset of all the schema. Spark will make the missing columns in one day's data as NULL.

  2. Evolve the schema using spark schema merging : Spark does schema merging by default. You can union the existing snapshot and with the incoming delta and read as json.

    val df1 = spark.read.json("/path/snapshot")  
    val df2 = spark.read.json("/path/delta")  
    spark.read.json(df1.toJSON.union(df2.toJSON))  
    

Upvotes: 1

Related Questions