Rajib Deb
Rajib Deb

Reputation: 1774

Partitioning large table in snowflake through a custom partitioner

We have a large table in snowflake which has more than 55 BILLION records. Users retrieve data from this table by providing YEAR and a SERIAL_NUMBER as the predicate. It takes a lot of time to retrieve the records. We do not want to enable auto clustering as that is becoming expensive for us. We have, therefore, come up with the following approach. I wanted to take inputs on this approach.

We are dividing the table into multiple partitions. For example if the table name is TABLE_XYZ. We are creating physical tables for each year like TABLE_XYZ_2016, TABLE_XYZ_2017, TABLE_XYZ_2018, TABLE_XYS_2019, TABLE_XYZ_2020. The latest year table is which gets changed frequently, the others are largely untouched. We have written a partitioner/splitter which reads a source similar table and splits the records by year and loads them in the corresponding year table ordering by year and serial_number. We have also created a view on top of all these tables. The idea is people who will need all years will go against the view, people who are interested in a certain year will go against the physical year table. We are expecting that this will reduce the time to retrieve the data. However, if there is a new year(lets say 2021), we will have to create that table and also redefine the view. I have not been able to find a way to create a dynamic view.

Please let me know if this approach has any loop holes.

Thanks

Upvotes: 2

Views: 18211

Answers (4)

Emanuel Oliveira
Emanuel Oliveira

Reputation: 116

-- clustering best practices.md

  1. always insert in tables ordered data (insert..order by, or with copy be sure files physically produced with ordered records.
  2. FYI just doing 1. typically by pass need of clustering ss snowflake uPartitions are sufficient to prune.

enabling cluster key:

  1. test table clustering with different set of simulated cluster keys if needed : depth in a perfect ordered table (no matter if clustering enabled or not on the table) = 1, if you see 2, 3 or more digits depth than your ingestion is messing up either with lots of dml or copys with unordered rows: https://docs.snowflake.com/en/sql-reference/functions/system_clustering_depth

  2. if SYSTEM$CLUSTERING_DEPTH() confirms a mess, then rebuild table to save thousands of dollars by enabling clustering on initial big unordered table in 2 ways:

4.1 create table ordered_table_as_should_had_happened_from_beginning as select * from table_xyz ORDER BY the_right_set_of_columns;

with clustering still disabled (If really needed because you can't get external files to be produced ordered, or you dont have means to order yourself the inbound files after they land etc, you can simulate and enable clustering and then the CTAS ORDER BY columns are the same as the cluster key columns obviously).

4.2 or if you have all the 3 years files, simply: create table new_xyz like table_xyz;

and COPY load all files again. creating initial load single file or 3 yearly files, a simple COPY OR 3 COPYs will be very fast and cheaper loading than paying snowflake compute recluster.

  1. finally fix the 2nd possible reason to mess up table ordering (again: no matter if clustering enabled or not): check if any DML happening in the table ? if yes, then:

Hope this helps, so in short:

  • feed ordered files/data.
  • test if clustering really needed
  • code review if dml happening on table looking for missing ORDER BY or reduce churning by merge delaying or lambda join view.
  • and fix via reloading (external cost) instead if paying a lot snowflake compute do that easy external work.

Cheers ~

Upvotes: 0

Simon D
Simon D

Reputation: 6269

If the history that was loaded into your table is unchanging then the cost of reclustering the table will be its highest on the first recluster. It should become cheaper over time, right?

If you didn't want to use auto re-clustering for some reason, why don't you do what you're doing now but into a single table? Run series of select queries where you order by the year and SERIAL_NUMBER and insert into a new table. To me, it doesn't make sense to split it into multiple tables.

Upvotes: 0

MMV
MMV

Reputation: 980

I don't think any database offers views addressing non-existing tables ;) What you can do is:

(easiest): create 10 or 20 empty "future" tables and add them to your view. They are going to be empty so they won't affect performance

(harder): create a job that would run somewhere end of the year and create the new yearly table and recreate the view

Upvotes: 0

John Ryan
John Ryan

Reputation: 56

There's a far simpler way of achieving incredible performance using cluster keys. See the documentation here: https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html

In summary, create a cluster key on (YEAR, SERIAL_NUMBER) and in background Snowflake will automatically recluster (sort) the data by those keys. Initially, this will probably re-sequence the entire table, but subsequently, it will only re-sort newly added values.

I did a test on a billion-row table where I created a unique sequence and put a cluster key on it performing a random look-up by the key and each fetch returned in milliseconds with just one I/O.

Snowflake also has a customer in the USA with over a PETABYTE of data in a single table (yes, 1,000 terabytes) which uses the same technique to achieve terrific query performance against the most recently added data.

To help tune the performance, I'd recommend (if possible) load the data as it arrives in YEAR format. IE. If you're doing a bulk load of several years, load one year at a time. Otherwise don't worry.

You can then remove your physical partitioning and views and have Snowflake keep the entire solution clean and automatically updated.

You will find the background clustering will have an initial cost to sort the data, but subsequently, there should be a little cost involved, and the performance gains will be worth the effort.

Upvotes: 4

Related Questions