Schema mismatch - Spark DataFrame written to Delta

Question

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the source dataframe schema. Is this expected or am I making a mistake here? Is there a way to get the schema of the written delta to match exactly with the source df?

scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,false), StructField(val1,StringType,true), StructField(val2,StringType,false), StructField(dt,StringType,true))

scala> df.write.format("delta").save("D:/temp/d1")

scala> spark.read.format("delta").load("D:/temp/d1").schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,true), StructField(val1,StringType,true), StructField(val2,StringType,true), StructField(dt,StringType,true))

Alfilercio · Accepted Answer

Writing in parquet, the underlying format of delta lake, can't guarantee the nullability of the column.

Maybe you wrote a parquet that for sure it's not null, but the schema is never validated on write in parquet, and any could append some data with the same schema, but with nulls. So spark will always put as nullable the columns, just to prevention.

This behavior can be prevented using a catalog, that will validate that the dataframe follows the expected schema.

Schema mismatch - Spark DataFrame written to Delta

Answers (2)

Related Questions