snowflake_user
snowflake_user

Reputation: 118

delete performance in clustered table

My table size details are as follows :

  Size : 8.5 TB,  
    total columns : 20 
    Rows : 520,712,026,254
    Cluster by Date_Period(varchar 50) : different values are 'Mar 2019' , 'Apr 2019', 'May 2019' ..... 'Apr 2021' total 25 months of data.
    
    clustering_information 
    total_partition_count : 725451
    total_constant_partition_count : 725440
    average_overlap : 0
    average_depth : 1.0

I want to delete 25th month of data and to keep till 24th Month. I am testing delete operation with hardcoded value as of now. The process is monthly. delete from table_name where date_period = 'Mar 2019' .

But it is taking more then 4 hrs and at last getting timed out. Why the simple delete is taking time even the field is clustered. How can i improve it.

Note : I am testing query on Cloned data taken from QA to dev.

Upvotes: 0

Views: 626

Answers (3)

Felipe Hoffa
Felipe Hoffa

Reputation: 59175

I think this might be an edge case, and a good opportunity to have support engage with the specifics of your case.

In the meantime I created a 10 billion row clustered table, while trying to reproduce the problem:

create or replace temp table ten_billionv2
cluster by (m)
as
select to_char(dateadd(minute, seq8(), '1990-01-01'), 'mon yyyy') m, seq8() seq, uniform(1, 10000, random()) value, uniform(1, 10000, random()) value2
from table(generator(rowcount=>100000)) a, table(generator(rowcount=>100000)) b;

select *
from ten_billionv2
limit 10;

enter image description here

This is smaller than the table you describe, but it gives us a good opportunity to see if any of the advice on the other answers and comments apply.

  • It took less than 10 minutes to create this table with an L warehouse.
  • It's clustered by a string.
  • Cloning it was fast.
  • Deleting data from it was fast - either from the original or from the clone.
create temp table ten_billionv2_clone_2
clone ten_billionv2
--2s
;

delete from ten_billionv2_clone_2
where m='Apr 2021'
--18s
;

delete from ten_billionv2
where m='Apr 2021'
--17s
;

delete from ten_billionv2_clone_2
where m='Apr 2190'
--9s
;

delete from ten_billionv2
where m='Dec 11576'
--25s
;

delete from ten_billionv2_clone_2
where m='Dec 11576'
--28s
;

enter image description here

I'll stop my experiments here - but what I would try next is having way more data per month - maybe the time will scale with the number of partitions involved. What I learned so far is that Snowflake is smart about using the clustering algorithm to only scan certain partitions.

Upvotes: 1

Jeffrey Jacobs
Jeffrey Jacobs

Reputation: 332

  1. Try using a larger warehouse. You might also try disabling the cluster key as this might contribute to additional resource consumption and then re-enabling it.
  2. Deleting is expensive. Instead of DELETE, try: CREATE NEW_TABLE AS (SELECT * FROM CLONED TABLE WHERE DATE_PERIOD <> 'Mar 2019' ORDER BY DATE_PERIOD) Swap the tables, and maybe add a cluster key to the new table. But it's already effectively clustered due the ORDER BY. You will probably want to use at least a LARGE DW for this. Note that the ORDER BY is necessary even though the original table is clustered.

Upvotes: 1

peterb
peterb

Reputation: 727

I recommend making each month a separate table, then union them together in a view. To drop a month, you just change the view SQL to not include that month. Another benefit is the data clustering by month will be automatic, and you can use a different column for clustering within each month table if needed.

create or replace view ALL_MONTHS as
select * from tbl_March2019
union
select * from tbl_April2019
...

Upvotes: 0

Related Questions