Reputation: 1774
We have a large table in snowflake which has more than 55 BILLION records. Users retrieve data from this table by providing YEAR and a SERIAL_NUMBER as the predicate. It takes a lot of time to retrieve the records. We do not want to enable auto clustering as that is becoming expensive for us. We have, therefore, come up with the following approach. I wanted to take inputs on this approach.
We are dividing the table into multiple partitions. For example if the table name is TABLE_XYZ. We are creating physical tables for each year like TABLE_XYZ_2016, TABLE_XYZ_2017, TABLE_XYZ_2018, TABLE_XYS_2019, TABLE_XYZ_2020. The latest year table is which gets changed frequently, the others are largely untouched. We have written a partitioner/splitter which reads a source similar table and splits the records by year and loads them in the corresponding year table ordering by year and serial_number. We have also created a view on top of all these tables. The idea is people who will need all years will go against the view, people who are interested in a certain year will go against the physical year table. We are expecting that this will reduce the time to retrieve the data. However, if there is a new year(lets say 2021), we will have to create that table and also redefine the view. I have not been able to find a way to create a dynamic view.
Please let me know if this approach has any loop holes.
Thanks
Upvotes: 2
Views: 18211
Reputation: 116
-- clustering best practices.md
enabling cluster key:
test table clustering with different set of simulated cluster keys if needed : depth in a perfect ordered table (no matter if clustering enabled or not on the table) = 1, if you see 2, 3 or more digits depth than your ingestion is messing up either with lots of dml or copys with unordered rows: https://docs.snowflake.com/en/sql-reference/functions/system_clustering_depth
if SYSTEM$CLUSTERING_DEPTH() confirms a mess, then rebuild table to save thousands of dollars by enabling clustering on initial big unordered table in 2 ways:
4.1 create table ordered_table_as_should_had_happened_from_beginning as select * from table_xyz ORDER BY the_right_set_of_columns;
with clustering still disabled (If really needed because you can't get external files to be produced ordered, or you dont have means to order yourself the inbound files after they land etc, you can simulate and enable clustering and then the CTAS ORDER BY columns are the same as the cluster key columns obviously).
4.2 or if you have all the 3 years files, simply: create table new_xyz like table_xyz;
and COPY load all files again. creating initial load single file or 3 yearly files, a simple COPY OR 3 COPYs will be very fast and cheaper loading than paying snowflake compute recluster.
Hope this helps, so in short:
Cheers ~
Upvotes: 0
Reputation: 6269
If the history that was loaded into your table is unchanging then the cost of reclustering the table will be its highest on the first recluster. It should become cheaper over time, right?
If you didn't want to use auto re-clustering for some reason, why don't you do what you're doing now but into a single table? Run series of select queries where you order by the year
and SERIAL_NUMBER
and insert into a new table. To me, it doesn't make sense to split it into multiple tables.
Upvotes: 0
Reputation: 980
I don't think any database offers views addressing non-existing tables ;) What you can do is:
(easiest): create 10 or 20 empty "future" tables and add them to your view. They are going to be empty so they won't affect performance
(harder): create a job that would run somewhere end of the year and create the new yearly table and recreate the view
Upvotes: 0
Reputation: 56
There's a far simpler way of achieving incredible performance using cluster keys. See the documentation here: https://docs.snowflake.com/en/user-guide/tables-clustering-keys.html
In summary, create a cluster key on (YEAR, SERIAL_NUMBER) and in background Snowflake will automatically recluster (sort) the data by those keys. Initially, this will probably re-sequence the entire table, but subsequently, it will only re-sort newly added values.
I did a test on a billion-row table where I created a unique sequence and put a cluster key on it performing a random look-up by the key and each fetch returned in milliseconds with just one I/O.
Snowflake also has a customer in the USA with over a PETABYTE of data in a single table (yes, 1,000 terabytes) which uses the same technique to achieve terrific query performance against the most recently added data.
To help tune the performance, I'd recommend (if possible) load the data as it arrives in YEAR format. IE. If you're doing a bulk load of several years, load one year at a time. Otherwise don't worry.
You can then remove your physical partitioning and views and have Snowflake keep the entire solution clean and automatically updated.
You will find the background clustering will have an initial cost to sort the data, but subsequently, there should be a little cost involved, and the performance gains will be worth the effort.
Upvotes: 4