Reputation: 55
In file 1, the JSON
element "image" is nested. The data looks like structured like this:
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
Resulting schema is correctly inferred by Spark:
val df1 = spark.read.json("/xxx/xxxx/xxxx/nested1.json")
df1.printSchema
root
|-- id: string (nullable = true)
|-- image: struct (nullable = true)
| |-- height: long (nullable = true)
| |-- url: string (nullable = true)
| |-- width: long (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
File nested2.json contains some nested elements and some non-nested (below the second line, element "image" is not nested):
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
{"id": "0002", "type": "donut", "name": "CupCake", "image": "images/0001.jpg"}
Resulting schema does not contain the nested data:
val df2 = spark.read.json("/xxx/xxx/xxx/nested2.json")
df2.printSchema
root
|-- id: string (nullable = true)
|-- image: string (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
Why is Spark not able to figure out the schema when non-nested elements are present?
How to process a JSON
file containing a mixture of records like this using Spark and Scala?
Upvotes: 1
Views: 961
Reputation: 28322
By not specifying a schema, Spark will infer it from the data source by reading the data once. Looking at the source code for inference of json
data, there is a comment in the code:
Infer the type of a collection of json records in three stages:
- Infer the type of each record
- Merge types by choosing the lowest type necessary to cover equal keys
- Replace any remaining null fields with string, the top type
In other words, only the most general data type is returned when there are differences. By mixing nested and un-nested types, only the un-nested will be returned, as all rows contain these values.
Instead of infering the schema, create it yourself and supply it to the method.
val schema = StructType(
StructField("id", StringType, true) ::
StructField("type", StringType, true) ::
StructField("name", StringType, true) ::
StructField(
"image",
StructType(
StructField("height", LongType, true) ::
StructField("url", StringType, true) ::
StructField("width", LongType, true) ::
Nil
))
) :: Nil
)
Then use the schema when reading:
val df2 = spark.read
.schema(schema)
.json("/xxx/xxx/xxx/nested2.json")
However, this approach will result in null for the "image" field when it is simply a string. To get both types, read the file twice and merge the results.
val df3 = spark.read
.json("/xxx/xxx/xxx/nested2.json")
The new dataframe has a different schema then the first. Let's make them equal and then merge the dataframes together:
val df4 = df3.withColumn("image",
struct($"image".as("url"),
lit(0L).as("height"),
lit(0L).as("width"))
.select("id", "type", "name", "image")
val df5 = df2.union(df4)
The last select
is to make sure the columns are in the same order as df2
, otherwise the union
will fail.
Upvotes: 1