adelinor
adelinor

Reputation: 813

Databricks schema enforcement issues

As suggested in the article about schema enforcement, a declared schema helps detecting issues early.

The two issues described below however are preventing me from creating a descriptive schema.

Comments on a table column are seen as a difference in the schema

# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')

# Create table: ... BOOM
spark.sql("""
   CREATE TABLE IF NOT EXISTS my_table (
       code STRING COMMENT 'Unique identifier'
   ) USING DELTA LOCATION '/my_table_location'
""")

This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.

Not null struct field shows as nullable

json_schema = StructType([
  StructField("code", StringType(), False)
])

json_df = (spark.read
  .schema(json_schema)
  .json('/my_input.json')
)
json_df.printSchema()

will show

root
  |-- code: string (nullable = true)

So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.

Any comments or suggestions are welcome.

Upvotes: 3

Views: 5772

Answers (1)

Denny Lee
Denny Lee

Reputation: 3254

With the execution of

test_df.write.format("delta").mode("append").save('/my_table_location')

You have already created a new Delta table with its specific schema as defined by test_df. This new table delta.`/my_table_location` already has the schema of code STRING.

If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.

spark.sql("""
   CREATE TABLE my_table 
     code STRING COMMENT 'unique identifier' 
    USING DELTA LOCATION '/my_table_location'
""") 

And then insert your data from your test_df into it, e.g.

test_df.createOrReplaceView("test_df_view")

spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")

Upvotes: 3

Related Questions