Reputation: 143
I'm unsure about the best approach when identifying and merging data for a specific scenario to identify deleted rows.
Im merging into a delta table, whom has rows that have a child_id (which is its primary key) and a parent_id. Multiple rows have the same parent_id.
The data I wish to merge into this table has data in the same form. But, its possible that for the data I merge in, there are less rows for a particular parent_id, thereby I want to identify these rows and set them as soft deleted via a column 'soft_deleted'=yes.
I need to identify rows in the target table that do not exist in the view but have the same parent_id and update 'soft_deleted' flag to 'yes'.
My approach was to first left_semi join the target table against the my data on parent_id;
then left_anti join the resultant df against my data on child_id.
The result of this would be a dataframe containing every row in the target that should be soft deleted.
I update the rows in this data frame with new column 'softdeleted'=true, then union onto my view df.
This is then merged into the target table.
Is this a viable long term strategy or would these joins have a dramatic decrease in performance if the delta table became 100s of millions of rows.
Upvotes: 0
Views: 22