harshini
harshini

Reputation: 97

Update only changed rows pyspark delta table databricks

There is a requirement to update only changed rows in an existing table compared to the created dataframe. So rightnow , i do subtract and get the changed rows, but not sure how to merge into existing table.

old_df = spark.sql("select * from existing table")
diff = new_df.subtract(old_df)

diff dataframe has to be now inserted(if new rows) or update existing records

(deltaTable.alias("full_df").merge(
    merge_df.alias("append_df"),
    "full_df.col1 = append_df.col1 OR full_df.col2 =append_df.col2") 
  .whenNotMatchedInsertAll() 
  .execute()
)

This is not updating existing records (case: col2 value changed ; col1 not changed)

Upvotes: 3

Views: 10928

Answers (1)

marat
marat

Reputation: 1387

.whenMatchedUpdateAll() accepts a condition which you can use to preserve unchanged rows:

(
  deltaTable
  .alias("full_df")
  .merge(
    merge_df.alias("append_df"),
    "full_df.col1 = append_df.col1 OR full_df.col2 = append_df.col2") 
  .whenNotMatchedInsertAll()
  .whenMatchedUpdateAll("full_df.col1 != append_df.col1 OR full_df.col2 != append_df.col2")
  .execute()
)

Upvotes: 7

Related Questions