sparkle
sparkle

Reputation: 7398

ClickHouse deduplication using idempotency_id column

I have a Clickhouse table like this. We ingest rows data where the unique field is idempotency_id.

This is a sample. We can't use other columns for uniquness (like max(date) o max(clicks).

2023-12-13 20:00:00  97  5c77ed31-90ba-4b2b-b8a7-b5d22b8eb662
2023-12-18 21:00:00 92  83b13098-2afb-434c-89ff-1b5e0c58bdf0
2023-12-12 20:00:00 76  217f0432-081d-4b40-9a22-8cd6460943ab
2023-12-13 15:00:00 74  d7eaccd4-e489-42ea-ae22-2bbadc1785eb
2023-12-18 20:00:00 72  4fb98924-c5d1-4b07-9d1e-a371f866dc43

I cannot use common deduplication strategies like ReplacingMergeTree(idempotency_id) or CollapsingMergeTree(idempotency_id) because idempotency_id is a string.

At the moment I remove duplicate at query-time using PARTITION BY ROW_NUMER but it consume to memory and query stopped.

Upvotes: 0

Views: 365

Answers (1)

sparkle
sparkle

Reputation: 7398

SOLVED.

CREATE TABLE my_table
(
    `idempotency_id` UUID,
    
     ...other fields    

    version UInt32 DEFAULT now() 
)
ENGINE = ReplacingMergeTree
PRIMARY KEY (idempotency_id)
ORDER BY (idempotency_id, ....)

Then when I want to be sure to remove duplicates I run

OPTIMIZE TABLE my_table

Upvotes: 0

Related Questions