Yasaman
Yasaman

Reputation: 53

Snowflake ingestion: Snowpipe/Stream/Tasks or External Tables/Stream/Tasks

For ingesting data from an external storage location into Snowflake when de-duping is necessary, I came across two ways:

Option 1:

  1. Create a Snowpipe for the storage location (Azure container or S3 bucket) which is automatically triggered by event notifications (Azure event grid and queues or AWS SQS) and copy data into a staging table in Snowflake
  2. Create a Stream for this staging table to capture change data
  3. Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key

Option 2:

  1. Create an external table with automatic refresh through event notifications (Azure event grid and queues or AWS SQS)
  2. Create a Stream for this external table to capture change data
  3. Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key

I believe if the merge statement wasn't necessary to enforce primary key and remove duplicates, Snowpipe was the clear winner because it copies changed data directly into a table in one step. However, since staging and merging the data is necessary, which option is better?

Thank you!

Upvotes: 2

Views: 1179

Answers (1)

Mo Agarwal
Mo Agarwal

Reputation: 1

We're using the external table approach. Our costs are significantly higher than anticipated though this may be because we chose to use serverless tasks given the volume of tables we're refreshing from ADLS.

Our snowflake rep suggested going the snowpipe route. The answer is always test and see what works best for you.

One thought is that you avoid the merge task entirely. Insert all data and build a view/dynamic table to show the latest version of each record.

Upvotes: 0

Related Questions