Reputation: 21
I have a problem with inserting data from a Data Frame into a Delta Table with the Identity column using Pyspark. It either fails with the schema mismatch error if I won't include the column in the Data Frame, or shouts that I cannot insert into the "generated always as identity" column. Usually, in SQL, I would not mention the column in the columns list while inserting data. How to deal with it in Pyspark?
The made-up exemplary code would look like this:
create table sample.table (
id bigint generated always as identity (start with 1 increment by 1),
name string,
address string
)
using delta
df = df.select("name", "address")
df.write.format("delta").mode("overwrite").saveAsTable("sample.table")
This is going to fail because of the schema mismatch. It will fail as well if do it like df = df.select(lit(None).alias("id"), "name", "address").
Errors I'm getting:
Providing values for GENERATED ALWAYS AS IDENTITY column pk_part_current is not supported.
AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: b83f1234-a178-486f-be53-2478cb4a1234). To enable schema migration using DataFrameWriter or DataStreamWriter, please set: '.option("mergeSchema", "true")'. For other operations, set the session configuration spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation specific to the operation for details.
Table schema:
root
-- id: long (nullable = false)
-- name: string (nullable = true)
-- address: string (nullable = true)Data schema:
root
-- name: string (nullable = true)
-- address string (nullable = true)To overwrite your schema or change partitioning, please set: '.option("overwriteSchema", "true")'.
Note that the schema can't be overwritten when using 'replaceWhere'.
I would appreciate your help.
Upvotes: 1
Views: 1039
Reputation: 21
I've used the option("mergeSchema", "true") and it kind of helped bypassing the issue. It ought to be used prudently as it can mess the target table schema because of a simple typo, though. Combining it with the lack of dropping columns functionality (available in Public Preview currently) it can get you in troubles.
Upvotes: 1