ClickHouse deduplication using idempotency_id column

Question

I have a Clickhouse table like this. We ingest rows data where the unique field is idempotency_id.

This is a sample. We can't use other columns for uniquness (like max(date) o max(clicks).

I cannot use common deduplication strategies like ReplacingMergeTree(idempotency_id) or CollapsingMergeTree(idempotency_id) because idempotency_id is a string.

At the moment I remove duplicate at query-time using PARTITION BY ROW_NUMER but it consume to memory and query stopped.

sparkle · Accepted Answer

SOLVED.

CREATE TABLE my_table
(
    `idempotency_id` UUID,
    
     ...other fields    

    version UInt32 DEFAULT now() 
)
ENGINE = ReplacingMergeTree
PRIMARY KEY (idempotency_id)
ORDER BY (idempotency_id, ....)

Then when I want to be sure to remove duplicates I run

OPTIMIZE TABLE my_table

ClickHouse deduplication using idempotency_id column

Answers (1)

Related Questions