Reputation: 118
My table size details are as follows :
Size : 8.5 TB,
total columns : 20
Rows : 520,712,026,254
Cluster by Date_Period(varchar 50) : different values are 'Mar 2019' , 'Apr 2019', 'May 2019' ..... 'Apr 2021' total 25 months of data.
clustering_information
total_partition_count : 725451
total_constant_partition_count : 725440
average_overlap : 0
average_depth : 1.0
I want to delete 25th month of data and to keep till 24th Month. I am testing delete operation with hardcoded value as of now. The process is monthly.
delete from table_name where date_period = 'Mar 2019'
.
But it is taking more then 4 hrs and at last getting timed out. Why the simple delete is taking time even the field is clustered. How can i improve it.
Note : I am testing query on Cloned data taken from QA to dev.
Upvotes: 0
Views: 626
Reputation: 59175
I think this might be an edge case, and a good opportunity to have support engage with the specifics of your case.
In the meantime I created a 10 billion row clustered table, while trying to reproduce the problem:
create or replace temp table ten_billionv2
cluster by (m)
as
select to_char(dateadd(minute, seq8(), '1990-01-01'), 'mon yyyy') m, seq8() seq, uniform(1, 10000, random()) value, uniform(1, 10000, random()) value2
from table(generator(rowcount=>100000)) a, table(generator(rowcount=>100000)) b;
select *
from ten_billionv2
limit 10;
This is smaller than the table you describe, but it gives us a good opportunity to see if any of the advice on the other answers and comments apply.
create temp table ten_billionv2_clone_2
clone ten_billionv2
--2s
;
delete from ten_billionv2_clone_2
where m='Apr 2021'
--18s
;
delete from ten_billionv2
where m='Apr 2021'
--17s
;
delete from ten_billionv2_clone_2
where m='Apr 2190'
--9s
;
delete from ten_billionv2
where m='Dec 11576'
--25s
;
delete from ten_billionv2_clone_2
where m='Dec 11576'
--28s
;
I'll stop my experiments here - but what I would try next is having way more data per month - maybe the time will scale with the number of partitions involved. What I learned so far is that Snowflake is smart about using the clustering algorithm to only scan certain partitions.
Upvotes: 1
Reputation: 332
Upvotes: 1
Reputation: 727
I recommend making each month a separate table, then union them together in a view. To drop a month, you just change the view SQL to not include that month. Another benefit is the data clustering by month will be automatic, and you can use a different column for clustering within each month table if needed.
create or replace view ALL_MONTHS as
select * from tbl_March2019
union
select * from tbl_April2019
...
Upvotes: 0