Reputation: 9
Lets say we're using this table:
create table table1 (
ingestion_time DateTime,
ingestion_day Date,
dim1 String,
met1 double, ...
)
engine=ReplicatedMergeTree(...),
partition=(ingestion_day)
order by = (...);
We have a usecase where we need to create new part every 15 mins and then for a running window (lets call it update window), possibly delete and recreate certain 15 minute parts.
For example if the update window is 15 days, i can on 2020-10-20 can choose to delete the part for 2020-10-15 23:45 and re ingest it.
If i partition by 15 minute interval, it would quickly lead to too many partition problem and so i am looking for a way to have fine grained control over parts merging in MergeTree where i can keep the 15 minute parts intact in the update window maybe manually call to merge the older parts into daily partitions.
Any suggestions on how to achieve this? I couldn't find any settings that let me control parts merging, though there are settings that i can use for retention of older parts that have merged.
Upvotes: 0
Views: 1291
Reputation: 15226
Merging is not manageable, it occurs in the background at an unknown time, so you can’t plan for it. There is OPTIMIZE-statement to initialize an unscheduled merge that finishes eventually.
Theoretically, it can be defined the arbitrary partition size, including 15 minutes. But it can be a bad way, it depends on the concrete case.
..
PARTITION BY toStartOfFifteenMinutes(ingestion_time)
..
to avoid the error "Too many partitions for single INSERT block" need to increase the limit of parameter max_partitions_per_insert_block.
And remove the outdated partition(s) by call DROP PARTITION.
As an alternate way can be considered using mutations to delete outdated rows, but it is usually an ineffective and not recommended way.
Look at ReplicatedReplacingMergeTree or ReplicatedCollapsingMergeTree to deduplicate/collapse data by CH means.
A retention policy can be defined on a column or table level by TTL-clause.
Upvotes: 1