JDev
JDev

Reputation: 1822

Druid how to drop duplicates in Kafka indexing service

I am using DRUID with Kafka Indexing service. I am trying to understanding how it handles duplicate messages.

Example

Consider I have following message in Kafka Topic[1 partition only]

[Offset=100]

{
  "ID":4,
  "POINTS":1005,
  "CREATED_AT":1616258354000000,
  "UPDATED_AT":1616304119000000
}

Now consider after 24 hours, somehow same message is pushed again to topic.

[Offset=101]

{
  "ID":4,
  "POINTS":1005,
  "CREATED_AT":1616258354000000,
  "UPDATED_AT":1616304119000000
}

Note: Payload has not changed.

Actual:Now, In DRUID I see the same message again.

Expected: What I expect is since the payload has not changed the message should be ignored.

My timestamp column is CREATED_AT

Upvotes: 0

Views: 1004

Answers (1)

William Nelson
William Nelson

Reputation: 675

Can you be sure that there will never be two unique events with the same timestamp other than duplicates? If so, you can try using rollup to eliminate the duplicates.

You can set that in the granularitySpec, and the queryGranularity will basically truncate all timestamps based on that granularity, and if ALL dimensions are identical, they get combined using the aggregation functions you set in the spec.

For the aggregation functions, you will want to use something like MAX or MIN, because SUM will add them up.

This will fail if you have multiple kafka partitions, but could be fixed with reindexing.

Upvotes: 2

Related Questions