PySpark to Identify Orphaned Records and Update Target Table Based on Parent ID

Question

I'm unsure about the best approach when identifying and merging data for a specific scenario to identify deleted rows.

Im merging into a delta table, whom has rows that have a child_id (which is its primary key) and a parent_id. Multiple rows have the same parent_id.

The data I wish to merge into this table has data in the same form. But, its possible that for the data I merge in, there are less rows for a particular parent_id, thereby I want to identify these rows and set them as soft deleted via a column 'soft_deleted'=yes.

I need to identify rows in the target table that do not exist in the view but have the same parent_id and update 'soft_deleted' flag to 'yes'.

My approach was to first left_semi join the target table against the my data on parent_id;

then left_anti join the resultant df against my data on child_id.

The result of this would be a dataframe containing every row in the target that should be soft deleted.

I update the rows in this data frame with new column 'softdeleted'=true, then union onto my view df.

This is then merged into the target table.

Is this a viable long term strategy or would these joins have a dramatic decrease in performance if the delta table became 100s of millions of rows.

PySpark to Identify Orphaned Records and Update Target Table Based on Parent ID

Answers (0)

Related Questions