Pysparker
Pysparker

Reputation: 145

Implementing SCD Type 2 with Delta Lake

I have a requirement to implement a SCD Type 2 in my Delta Tables. The Scenario is as under.

Source Table Columns are:

---------------------------------------------------
state.  Code.   Name.    value.    insert_datetime
---------------------------------------------------
TX.     AX       transport  2.2        2023-01-01
AL      BB       Mining     1.1        2023-01-03
BY      XC       Infras.    4.1        2022-06-06
JJ      NR       Educ       5.6        2023-08-10 

Destination Table (Silver) is as under:

--------------------------------------------------------------------------------
state.  Code.   Name.    value.    insert_datetime.  is_current.     ExpiryDate
--------------------------------------------------------------------------------
TX.     AX       transport  2.2        2023-08-14      Y              9999-12-31
AL      BB       Mining     1.1        2023-08-14.     Y              9999-12-31
BY      XC       Infras.    4.4        2023-08-14.     N              2023-08-14
BY      XC       Infras.    4.1        2023-08-14.     Y              9999-12-31
JJ      NR       Educ       5.6        2023-08-14      Y              9999-12-31

What i want to achieve is to Insert the new records, delete the missing records and update the existing records marking the previous one as inactive and inserting the new one (as for state BY).

I want to achieve this using either Python or PySpark on a databricks platform and have used a few approaches but being new to SDC Type 2 i couldn't figure out how to achieve this.

All examples i have seen show a surrogate key and a merge key, One of my questions is what is the need of a separate surrogate key and merge_key.

I am thinking of creating a surrogate key based upon my key columns [state, code, name, value] in both dataframes (source and target) but i am not sure how to achieve the results end to end.

Upvotes: 0

Views: 1525

Answers (1)

Jonathan
Jonathan

Reputation: 2043

If you're using Delta Lake version >= 2.0.0, why don't consider using the Change Data Feed (CDF) features to achieve your goal?

By enabling the table with CDF feature, you can read the table in latest snapshot or read all the historical change log by setting the readChangeFeed and startingVersion or startingTimestamp.

Although it's not exactly the same of SCD Type 2, I think they are similar:

  1. CDF provide _change_type metadata to indicate if the record is in insert, update or delete status.
  2. CDF provide _commit_timestamp metadata to indicate when the record is updated.
  3. CDF provide _commit_version metadata to indicate which is the latest updated record.

You can try it by using {state, code, name} as your merging key. You can check the example in the Databricks blog: https://www.databricks.com/blog/2021/06/09/how-to-simplify-cdc-with-delta-lakes-change-data-feed.html

Upvotes: 0

Related Questions