Reputation: 145
I have a requirement to implement a SCD Type 2 in my Delta Tables. The Scenario is as under.
Source Table Columns are:
---------------------------------------------------
state. Code. Name. value. insert_datetime
---------------------------------------------------
TX. AX transport 2.2 2023-01-01
AL BB Mining 1.1 2023-01-03
BY XC Infras. 4.1 2022-06-06
JJ NR Educ 5.6 2023-08-10
Destination Table (Silver) is as under:
--------------------------------------------------------------------------------
state. Code. Name. value. insert_datetime. is_current. ExpiryDate
--------------------------------------------------------------------------------
TX. AX transport 2.2 2023-08-14 Y 9999-12-31
AL BB Mining 1.1 2023-08-14. Y 9999-12-31
BY XC Infras. 4.4 2023-08-14. N 2023-08-14
BY XC Infras. 4.1 2023-08-14. Y 9999-12-31
JJ NR Educ 5.6 2023-08-14 Y 9999-12-31
What i want to achieve is to Insert the new records, delete the missing records and update the existing records marking the previous one as inactive and inserting the new one (as for state BY).
I want to achieve this using either Python or PySpark on a databricks platform and have used a few approaches but being new to SDC Type 2 i couldn't figure out how to achieve this.
All examples i have seen show a surrogate key and a merge key, One of my questions is what is the need of a separate surrogate key and merge_key.
I am thinking of creating a surrogate key based upon my key columns [state, code, name, value] in both dataframes (source and target) but i am not sure how to achieve the results end to end.
Upvotes: 0
Views: 1525
Reputation: 2043
If you're using Delta Lake version >= 2.0.0, why don't consider using the Change Data Feed (CDF) features to achieve your goal?
By enabling the table with CDF feature, you can read the table in latest snapshot or read all the historical change log by setting the readChangeFeed
and startingVersion
or startingTimestamp
.
Although it's not exactly the same of SCD Type 2, I think they are similar:
_change_type
metadata to indicate if the record is in insert
, update
or delete
status._commit_timestamp
metadata to indicate when the record is updated._commit_version
metadata to indicate which is the latest updated record.You can try it by using {state, code, name} as your merging key. You can check the example in the Databricks blog: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html
Upvotes: 0