Reputation: 43

Snowflake query performance is unexpectedly slower for external Parquet tables vs. internal tables

When I run queries on external Parquet tables in Snowflake, the queries are orders of magnitude slower than on the same tables copied into Snowflake or with any other cloud data warehouse I have tested on the same files.

Context:

I have tables belonging to the 10TB TPC-DS dataset in Parquet format on GCS and a Snowflake account in the same region (US Central). I have loaded those tables into Snowflake using create as select. I can run TPC-DS queries(here #28) on these internal tables with excellent performance. I was also able to query those files on GCS directly with data lake engines with excellent performance, as the files are "optimally" sized and internally sorted. However, when I query the same external tables on Snowflake, the query does not seem to finish in reasonable time (>4 minutes and counting, as opposed to 30 seconds, on the same virtual warehouse). Looking at the query profile, it seems that the number of records read in the table scans keeps growing indefinitely, resulting in a proportional amount of spilling to disk.

The table happens to be partitioned but it those not matter on the query of interest (which I tested with other engines).

What I would expect:

Assuming proper data "formatting", I would expect no major performance degradation compared to internal tables, as the setup is technically the same - data stored in columnar format in cloud object store - and as it is advertised as such by Snowflake. For example I saw no performance degradation with BigQuery on the exact same experiment.

Other than double checking my setup, I see don't see many things to try...

This is what the "in progress" part of the plan looks like 4 minutes into execution on the external table. All other operators are at 0% progress. You can see external bytes scanned=bytes spilled and 26G!! rows are produced. And this is what it looked like on a finished execution on the internal table executed in ~20 seconds. You can see that the left-most table scan should produce 1.4G rows but had produced 23G rows with the external table.

This is a sample of the DDL I used (I also tested without defining the partitioning column):

create or replace external table tpc_db.tpc_ds.store_sales (
    ss_sold_date_sk bigint as 
      cast(split_part(split_part(metadata$filename, '/', 4), '=', 2) as bigint)
    ,
    ss_sold_time_sk bigint as (value:ss_sold_time_sk::bigint),
    ss_item_sk bigint as (value:ss_item_sk::bigint),
    ss_customer_sk bigint as (value:ss_customer_sk::bigint),
    ss_cdemo_sk bigint as (value:ss_cdemo_sk::bigint),
    ss_hdemo_sk bigint as (value:ss_hdemo_sk::bigint),
    ss_addr_sk bigint as (value:ss_addr_sk::bigint),
    ss_store_sk bigint as (value:ss_store_sk::bigint),
    ss_promo_sk bigint as (value:ss_promo_sk::bigint),
    ss_ticket_number bigint as (value:ss_ticket_number::bigint),
    ss_quantity bigint as (value:ss_quantity::bigint),
    ss_wholesale_cost double as (value:ss_wholesale_cost::double),
    ss_list_price double as (value:ss_list_price::double),
    ss_sales_price double as (value:ss_sales_price::double),
    ss_ext_discount_amt double as (value:ss_ext_discount_amt::double),
    ss_ext_sales_price double as (value:ss_ext_sales_price::double),
    ss_ext_wholesale_cost double as (value:ss_ext_wholesale_cost::double),
    ss_ext_list_price double as (value:ss_ext_list_price::double),
    ss_ext_tax double as (value:ss_ext_tax::double),
    ss_coupon_amt double as (value:ss_coupon_amt::double),
    ss_net_paid double as (value:ss_net_paid::double),
    ss_net_paid_inc_tax double as (value:ss_net_paid_inc_tax::double),
    ss_net_profit double as (value:ss_net_profit::double)
)
    partition by (ss_sold_date_sk)
    with location = @tpc_ds/store_sales/
    file_format = (type = parquet)
    auto_refresh = false
    pattern = '.*sales.*[.]parquet';

Upvotes: 4

Answers (3)

Triamus

Reputation: 2505

I don't know how your query looks but there is also a small chance that you suffer from a known issue where Snowflake interprets a function in the partition filter as dynamic and thus runs over all data, see details in Joao Marques blog on Medium: Using Snowflake External Tables? You must read this!.

Example of how not to do it from the blog

SELECT COUNT(*) 
FROM EXTERNAL_TABLE 
WHERE PARTITION_KEY = dateadd(day, -1, current_date)

Example of how to do it from the blog

SET my_var = (select dateadd(day, -1, current_date));
SELECT COUNT(*) 
FROM EXTERNAL_TABLE 
WHERE PARTITION_KEY = $my_var

All credits to the blog author, I have merely stumbled across this issue myself and found his blog.

Upvotes: 0

Simeon Pilgrim

Reputation: 25928

Snowflake didn't originally have external file queries, nor did it originally have Parquet support. I feel like I remember when external queries arrives, it was a simple read 100% of all the files into the system and the start processing. Which aligns with out you are seeing. This was a blessing because the prior state was to have to load all files into a staging table, and then run a filter on that, and sometimes if it was a one off query (but it almost never is in the end) executing SQL against the raw files, was rather helpful.

Yes it should be possible to optimize parquet file reads, to gather the meta data and then eliminate wasteful partiton reads. But that is not the order of evolution. So I am not surprised by you findings.

I would never suggest using an external data model as the general day-to-day Snowflake operations, as it not presently optimized for that. For two reasons, the disk costs of storing it in snowflake are the same as storing it in S3, and Snowflake can then have complete control over meta data, and read/write sync between nodes. Which all amounts to performance.

Also spilling to local storage is not bad per say, spilling to remote is the worst spilling. But it does appear that you are getting the effective result of a full file import and then process.

Upvotes: 2

Iñigo González

Reputation: 3955

Probably Snowflake plan assumes it must read every parquet file because it cannot tell beforehand if the files are sorted, number of unique values, nulls, minimum and maximum values for each column, etc.

This information is stored as an optional field in Parquet, but you'll need to read the parquet metadata first to find out.

When Snowflake uses internal tables, it has full control about storage, has information about indexes (if any), column stats, and how to optimize a query both from a logical and physical perspective.

Upvotes: 0

Snowflake query performance is unexpectedly slower for external Parquet tables vs. internal tables

Answers (3)

Related Questions