Andy McWilliams
Andy McWilliams

Reputation: 143

PySpark to Identify Orphaned Records and Update Target Table Based on Parent ID

I'm unsure about the best approach when identifying and merging data for a specific scenario to identify deleted rows.

Im merging into a delta table, whom has rows that have a child_id (which is its primary key) and a parent_id. Multiple rows have the same parent_id.

The data I wish to merge into this table has data in the same form. But, its possible that for the data I merge in, there are less rows for a particular parent_id, thereby I want to identify these rows and set them as soft deleted via a column 'soft_deleted'=yes.

I need to identify rows in the target table that do not exist in the view but have the same parent_id and update 'soft_deleted' flag to 'yes'.

My approach was to first left_semi join the target table against the my data on parent_id;

then left_anti join the resultant df against my data on child_id.

The result of this would be a dataframe containing every row in the target that should be soft deleted.

I update the rows in this data frame with new column 'softdeleted'=true, then union onto my view df.

This is then merged into the target table.

Is this a viable long term strategy or would these joins have a dramatic decrease in performance if the delta table became 100s of millions of rows.

Upvotes: 0

Views: 22

Answers (0)

Related Questions