Databricks/pyspark - insert into table with identity

Question

I have a problem with inserting data from a Data Frame into a Delta Table with the Identity column using Pyspark. It either fails with the schema mismatch error if I won't include the column in the Data Frame, or shouts that I cannot insert into the "generated always as identity" column. Usually, in SQL, I would not mention the column in the columns list while inserting data. How to deal with it in Pyspark?

The made-up exemplary code would look like this:

SQL:

create table sample.table (
id bigint generated always as identity (start with 1 increment by 1),
name string,
address string
)
using delta

Pyspark:

df = df.select("name", "address")
df.write.format("delta").mode("overwrite").saveAsTable("sample.table")

This is going to fail because of the schema mismatch. It will fail as well if do it like df = df.select(lit(None).alias("id"), "name", "address").

Errors I'm getting:

if I use lit(None).alias("id"):

Providing values for GENERATED ALWAYS AS IDENTITY column pk_part_current is not supported.

if I exclude the Identity column:

AnalysisException: A schema mismatch detected when writing to the Delta table (Table ID: b83f1234-a178-486f-be53-2478cb4a1234). To enable schema migration using DataFrameWriter or DataStreamWriter, please set: '.option("mergeSchema", "true")'. For other operations, set the session configuration spark.databricks.delta.schema.autoMerge.enabled to "true". See the documentation specific to the operation for details.

Table schema:
root
-- id: long (nullable = false)
-- name: string (nullable = true)
-- address: string (nullable = true)

Data schema:
root
-- name: string (nullable = true)
-- address string (nullable = true)

To overwrite your schema or change partitioning, please set: '.option("overwriteSchema", "true")'.

Note that the schema can't be overwritten when using 'replaceWhere'.

I would appreciate your help.

Databricks/pyspark - insert into table with identity

Answers (1)

Related Questions