What is difference in performance for below two cases, which one is cost optimised in case of execution as well as storage. Create external table with files in S3, and load("INSERT INTO.... SELECT..." ) to Snowflake target from external table. Load data to temporary table from S3 and then load("INSERT INTO.... SELECT..." ) to Snowflake target from temporary table.

Reputation: 151

Snowflake load data from S3(COPY) vs load from EXTERNAL table

What is difference in performance for below two cases, which one is cost optimised in case of execution as well as storage.

Create external table with files in S3, and load("INSERT INTO.... SELECT..." ) to Snowflake target from external table.
Load data to temporary table from S3 and then load("INSERT INTO.... SELECT..." ) to Snowflake target from temporary table.

Upvotes: 1

Answers (3)

Anthony Kelly

Reputation: 465

I tested 2 scenarios for loading 50GB of parquet data on S3 into a Snowflake table (XS warehouse).

Create EXTERNAL TABLE to Parquet data on S3 and INSERT from this external table to load table in Snowflake: 1hr 41m 9s
COPY INTO from S3 EXTERNAL STAGE: 1hr 17m 49s

So INSERT from external table is approximately 28% slower than COPY INTO in this test.

Upvotes: 1

Eugene Kim

Reputation: 1

This is an old question but I wanted to add a different point of view where external tables make some sense. If it didn't make sense for some users, Snowflake wouldn't have spent precious resources creating it in the first place, right?

I'm currently working on ingesting a constant stream of data similar to IoT devices but more focused in purpose. Blob storage is the main point of storage for this data and it will constantly grow. However, not all data is useful. We have a separate data ingestion pipeline that tells us what devices and what duration has useful data.

At this point, there are two solutions we could use (eliminating just copying everything as the useless data is truly useless and quite large). We could create dynamic COPY INTO scripts that only copy in the useful data in batch. There's some complexity here due to the variable nature of "useful data". We'd still need to join this data but it's in Snowflake now.

What we did instead was set up the blob storage as external tables and directly join it to the tables that defines useful data, and save that into an incremental table(dbt). The blob storage is partitioned for this query. The only downside here is that you need to refresh the external table's metadata to include new files(don't want to auto-refresh as the data is streaming and will require a warehouse be up constantly), and there's some performance hit from hitting the external table although it doesn't really matter since it's done in batch anyways.

I would like to look longer term at cost comparison but for now it's not worth looking into because it costs so little.

Upvotes: 0

NickW

Reputation: 9818

OK - if you want to use the power of the Snowflake platform as much as possible (pushdown optimisation) then you need to get your data into Snowflake as efficiently as possible first and then run your SQL queries against it ((join, filter, aggregator, etc). Use COPY to move your S3/Azure/Google files into Snowflake tables and then run INSERT... SELECT against these.

There is no reason to create EXTERNAL tables and, if you do, it will perform much worse than the approach I have proposed.

External Tables - short explanation

For the sake of simplicity, let's assume that your Snowflake instance is running on AWS and you also have some files in an S3 bucket.

All your Snowflake data is being stored in S3 by Snowflake but in a heavily compressed and optimised format. Snowflake holds metadata about where and what your data is that allows it to present your data as tables/columns.

An External table is basically exactly the same thing: Snowflake holds metadata about the files in your S3 bucket that allows it to present the data as tables/columns. The differences are that:

The external data is not being compressed/optimised and therefore it takes up more storage and is slower to query
Effectively, the query engine for your external table is the S3 environment (unless/until Snowflake has read the data into memory/cache where it can then process it as though it was Snowflake data)
There is probably some element of network latency - depending on where your Snowflake account and S3 buckets are located in the global AWS infrastructure

Hope this helps?

Upvotes: 2

Snowflake load data from S3(COPY) vs load from EXTERNAL table

Answers (3)

Related Questions