NITS
NITS

Reputation: 237

Schema mismatch - Spark DataFrame written to Delta

When writing a dataframe to delta format, the resulting delta does not seem to follow the schema of the dataframe that was written. Specifically, the 'nullable' property of a field seems to be always 'true' in the resulting delta regardless of the source dataframe schema. Is this expected or am I making a mistake here? Is there a way to get the schema of the written delta to match exactly with the source df?

scala> df.schema
res2: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,false), StructField(val1,StringType,true), StructField(val2,StringType,false), StructField(dt,StringType,true))

scala> df.write.format("delta").save("D:/temp/d1")

scala> spark.read.format("delta").load("D:/temp/d1").schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(device_id,StringType,true), StructField(val1,StringType,true), StructField(val2,StringType,true), StructField(dt,StringType,true))

Upvotes: 1

Views: 2945

Answers (2)

user3939554
user3939554

Reputation:

The problem is that a lot of users thought that their schema was not nullable, and wrote null data. Then they couldn't read the data back as their parquet files were corrupted. In order to avoid this, we always assume the table schema is nullable in Delta. In Spark 3.0, when creating a table, you will be able to specify columns as NOT NULL. This way, Delta will actually prevent null values from being written, because Delta will check that the columns are in fact not null when writing it.

Upvotes: 1

Alfilercio
Alfilercio

Reputation: 1118

Writing in parquet, the underlying format of delta lake, can't guarantee the nullability of the column.

Maybe you wrote a parquet that for sure it's not null, but the schema is never validated on write in parquet, and any could append some data with the same schema, but with nulls. So spark will always put as nullable the columns, just to prevention.

This behavior can be prevented using a catalog, that will validate that the dataframe follows the expected schema.

Upvotes: 3

Related Questions