Gavin Wilson
Gavin Wilson

Reputation: 522

Streams in Snowflake: Capturing all changes to a row, not only net changes until present?

(Submitting on behalf of a Snowflake User - hoping to find a solution to expedite resolution)


We are investigating the use of streams to accomplish a slowly changing dimension.

The following articles were helpful, but we're looking to find a way to build a more exhaustive change-capture?

  1. https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-1/
  2. https://community.snowflake.com/s/article/Building-a-Type-2-Slowly-Changing-Dimension-in-Snowflake-Using-Streams-and-Tasks-Part-2

Consider this example:

1) Insert into MyTable row 1

2) Select * from StreamOfMyTable <====== Returns the insert, as expected

3) Update MyTable where Row = 1

4) Select * from StreamOfMyTable

Only shows the Net Change, one insert , containing the updated values from step 3. If we were to consume the stream here we have no way of knowing that Row 1 used to have the values given it in step 1.

In step 4 we would like to see in the stream both the original insert in step 1 and the update in step 3. This is so we can build a more complete SCD Type 2 and not simply the net change at the time when we consume the stream using a DML operation.

We did see how with time travel, we could make a new StreamOfMyTableAtSpecificPointInTime by using the "before (statement => 'my query guid')" phrase. But this seems extreme to have to code for with all of our tables. Is there a more native way to capture ALL changes?

NOTE:
Was asked "Why do you need the previous values of the row to build an SCD type 2? You should just be able to merge into your final dimension based on some ID and update the previous record to "close it" right? The previous values should be in your final dimension?"

My response: "The stream does not contain multiple changes to the same row, only the net change to it. Try updating a row more than once before processing the stream. Notice how only the final value is in the stream."


Any recommendations on how to proceed?

Upvotes: 1

Views: 1184

Answers (3)

Marco Roy
Marco Roy

Reputation: 5243

In September 2022, Snowflake added the SYSTEM$STREAM_BACKLOG function, which allows users to retrieve the set of table versions between the stream's current offset and the current timestamp.

Each table version represents one DML operation (INSERT, UPDATE, DELETE, or TRUNCATE), so you should be able to process each version (one at a time) into your SCD table.

Upvotes: 0

Marco Roy
Marco Roy

Reputation: 5243

Where are the inserts and the updates coming from?

If you are building an SCD system, you should be able to get all the changes from the source (inserts, updates, and deletes).

Then, instead of directly applying those changes to a target table, and trying to capture them using a stream, you should record the changes themselves in an events table, and compute your SCD directly from the events.

Upvotes: 0

John
John

Reputation: 36

Simply put, the behavior you describe is how Streams work in Snowflake. They are not to be considered and audit log of the table.

You can run a tasks and process the stream on smaller intervals but if there are multiple changes on the source table only the net of the DML operations is available in the stream when it is processed.

How rapidly do you expect updates to take place in the source table for a given unique record?

Upvotes: 2

Related Questions