Reputation: 813
As suggested in the article about schema enforcement, a declared schema helps detecting issues early.
The two issues described below however are preventing me from creating a descriptive schema.
Comments on a table column are seen as a difference in the schema
# Get data
test_df = spark.createDataFrame([('100000146710')], ['code'])
# ... save
test_df.write.format("delta").mode("append").save('/my_table_location')
# Create table: ... BOOM
spark.sql("""
CREATE TABLE IF NOT EXISTS my_table (
code STRING COMMENT 'Unique identifier'
) USING DELTA LOCATION '/my_table_location'
""")
This will fail with AnalysisException: The specified schema does not match the existing schema at /my_table_location . The only solution I found is to drop the columnt comments.
Not null struct field shows as nullable
json_schema = StructType([
StructField("code", StringType(), False)
])
json_df = (spark.read
.schema(json_schema)
.json('/my_input.json')
)
json_df.printSchema()
will show
root
|-- code: string (nullable = true)
So despite the schema declaration stating that a field is not null, the field shows as nullable in the dataframe. Because of this, adding a NOT NULL constraint on the table column will trigger the AnalysisException error.
Any comments or suggestions are welcome.
Upvotes: 3
Views: 5772
Reputation: 3254
With the execution of
test_df.write.format("delta").mode("append").save('/my_table_location')
You have already created a new Delta table with its specific schema as defined by test_df
. This new table delta.`/my_table_location`
already has the schema of code STRING
.
If you would like to create a comment within the schema, perhaps first create the table as you would like it defined, e.g.
spark.sql("""
CREATE TABLE my_table
code STRING COMMENT 'unique identifier'
USING DELTA LOCATION '/my_table_location'
""")
And then insert your data from your test_df into it, e.g.
test_df.createOrReplaceView("test_df_view")
spark.sql("""
INSERT INTO my_table (code) SELECT code FROM test_df_view
""")
Upvotes: 3