A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both ( ReplacingMergeTree :deduplication and SummingMergeTree :aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here ). Is there a way to achieve deduplication first and then do a rollup in Clickhouse? Approaches we have been thinking of: Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as group by , distinct or a window function with row_number and filtering by rownum=1 afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable. Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an INSERT INTO .. SELECT statement using FINAL (I know, shouldn't be used) or some other SQL deduplication as outlined above. In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.

Reputation: 9765

Clickhouse: deduplication before rollup (after reading from Kafka)

A common use case in data processing is deduplication and aggregation/rollups. Clickhouse supports both (ReplacingMergeTree:deduplication and SummingMergeTree:aggregation). We are struggling putting both together: Materialized Views cannot be used to move the data from the deduplicated table to the rollup table because they trigger on insert, which happens before the deduplication (see the note here).

Is there a way to achieve deduplication first and then do a rollup in Clickhouse?

Approaches we have been thinking of:

Doing the deduplication on insert (e.g. a Materialized View which reads from Kafka). The already deduplicated data would be written to a SummingMergeTree table which then does the rollup. The deduplication could be done using standard SQL techniques such as group by, distinct or a window function with row_number and filtering by rownum=1 afterwards. The downside of this approach is that deduplication is only applied within the blocks read from Kafka, but not outside. The deduplication window is not adjustable.
Use a ReplacingMergeTree table letting Clickhouse do the deduplication, but additionally run an external, periodic scheduler to move the data into a SummingMergeTree table. "Moving" would be an INSERT INTO .. SELECT statement using FINAL (I know, shouldn't be used) or some other SQL deduplication as outlined above.

In all the documentation, blog posts and YT videos I have read and seen so far I haven't found a recommended (if possible clickhouse only) to first deduplicate a Kafka stream by id and then performing an aggregation on the data.

Upvotes: 2

Answers (2)

Ryadh Dahimene

Reputation: 86

Additionally, for your option two, you'll be able in the future to use Refreshable Materialized Views (not merged yet) for your scheduling. https://github.com/ClickHouse/ClickHouse/issues/33919

Upvotes: 1

gingerwizard

Reputation: 522

If the duplicates occur within a batch, you could try and make your target table for the inserts a ReplacingMergeTree. A mv would then trigger on this table and use FINAL to insert into a summingmergetree. I believe this is your option 1 - your deduplication window is as large as the insert block size.

Upvotes: 0

Clickhouse: deduplication before rollup (after reading from Kafka)

Answers (2)

Related Questions