mnr
mnr

Reputation: 55

How to parse json with mixed nested and non-nested structure?

In file 1, the JSON element "image" is nested. The data looks like structured like this:

{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}

Resulting schema is correctly inferred by Spark:

val df1 = spark.read.json("/xxx/xxxx/xxxx/nested1.json")
df1.printSchema

root
 |-- id: string (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- url: string (nullable = true)
 |    |-- width: long (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

File nested2.json contains some nested elements and some non-nested (below the second line, element "image" is not nested):

{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
{"id": "0002", "type": "donut", "name": "CupCake", "image": "images/0001.jpg"}

Resulting schema does not contain the nested data:

val df2 = spark.read.json("/xxx/xxx/xxx/nested2.json")
df2.printSchema

root
 |-- id: string (nullable = true)
 |-- image: string (nullable = true)
 |-- name: string (nullable = true)
 |-- type: string (nullable = true)

Why is Spark not able to figure out the schema when non-nested elements are present?

How to process a JSON file containing a mixture of records like this using Spark and Scala?

Upvotes: 1

Views: 961

Answers (1)

Shaido
Shaido

Reputation: 28322

By not specifying a schema, Spark will infer it from the data source by reading the data once. Looking at the source code for inference of json data, there is a comment in the code:

Infer the type of a collection of json records in three stages:

    1. Infer the type of each record
    1. Merge types by choosing the lowest type necessary to cover equal keys
    1. Replace any remaining null fields with string, the top type

In other words, only the most general data type is returned when there are differences. By mixing nested and un-nested types, only the un-nested will be returned, as all rows contain these values.


Instead of infering the schema, create it yourself and supply it to the method.

val schema = StructType( 
  StructField("id", StringType, true) ::
  StructField("type", StringType, true) ::
  StructField("name", StringType, true) ::
  StructField(
    "image",
    StructType( 
      StructField("height", LongType, true) ::
      StructField("url", StringType, true) ::
      StructField("width", LongType, true) ::
      Nil
    ))
  ) :: Nil
)

Then use the schema when reading:

val df2 = spark.read
  .schema(schema)
  .json("/xxx/xxx/xxx/nested2.json")

However, this approach will result in null for the "image" field when it is simply a string. To get both types, read the file twice and merge the results.

val df3 = spark.read
  .json("/xxx/xxx/xxx/nested2.json")

The new dataframe has a different schema then the first. Let's make them equal and then merge the dataframes together:

val df4 = df3.withColumn("image", 
    struct($"image".as("url"), 
           lit(0L).as("height"), 
           lit(0L).as("width"))
  .select("id", "type", "name", "image")

val df5 = df2.union(df4)

The last select is to make sure the columns are in the same order as df2, otherwise the union will fail.

Upvotes: 1

Related Questions