Reputation: 151
What is difference in performance for below two cases, which one is cost optimised in case of execution as well as storage.
Upvotes: 1
Views: 2783
Reputation: 465
I tested 2 scenarios for loading 50GB of parquet data on S3 into a Snowflake table (XS warehouse).
So INSERT from external table is approximately 28% slower than COPY INTO in this test.
Upvotes: 1
Reputation: 1
This is an old question but I wanted to add a different point of view where external tables make some sense. If it didn't make sense for some users, Snowflake wouldn't have spent precious resources creating it in the first place, right?
I'm currently working on ingesting a constant stream of data similar to IoT devices but more focused in purpose. Blob storage is the main point of storage for this data and it will constantly grow. However, not all data is useful. We have a separate data ingestion pipeline that tells us what devices and what duration has useful data.
At this point, there are two solutions we could use (eliminating just copying everything as the useless data is truly useless and quite large). We could create dynamic COPY INTO scripts that only copy in the useful data in batch. There's some complexity here due to the variable nature of "useful data". We'd still need to join this data but it's in Snowflake now.
What we did instead was set up the blob storage as external tables and directly join it to the tables that defines useful data, and save that into an incremental table(dbt). The blob storage is partitioned for this query. The only downside here is that you need to refresh the external table's metadata to include new files(don't want to auto-refresh as the data is streaming and will require a warehouse be up constantly), and there's some performance hit from hitting the external table although it doesn't really matter since it's done in batch anyways.
I would like to look longer term at cost comparison but for now it's not worth looking into because it costs so little.
Upvotes: 0
Reputation: 9818
OK - if you want to use the power of the Snowflake platform as much as possible (pushdown optimisation) then you need to get your data into Snowflake as efficiently as possible first and then run your SQL queries against it ((join, filter, aggregator, etc). Use COPY to move your S3/Azure/Google files into Snowflake tables and then run INSERT... SELECT against these.
There is no reason to create EXTERNAL tables and, if you do, it will perform much worse than the approach I have proposed.
External Tables - short explanation
For the sake of simplicity, let's assume that your Snowflake instance is running on AWS and you also have some files in an S3 bucket.
All your Snowflake data is being stored in S3 by Snowflake but in a heavily compressed and optimised format. Snowflake holds metadata about where and what your data is that allows it to present your data as tables/columns.
An External table is basically exactly the same thing: Snowflake holds metadata about the files in your S3 bucket that allows it to present the data as tables/columns. The differences are that:
Hope this helps?
Upvotes: 2