Atul Bhatia
Atul Bhatia

Reputation: 1795

Confused about Kafka exactly-once semantics

So i've been reading about kafka's exactly once semantics, and I'm a bit confused about how it works.

I understand how the producer avoids sending duplicate messages (in case the ack from the broker fails), but what I don't understand is how exactly-once works in the scenario where the consumer processes the message but then crashes before committing the offset. Won't kafka retry in that scenario?

Upvotes: 4

Views: 1979

Answers (2)

Yannick
Yannick

Reputation: 1418

Radal explained it well in its answer, regarding exactly once in a isolated Kafka cluster.

When dealing with an external database ( transactional at least) , one easy way to achieve exactly once is to UPDATE one row ( in a sgbd transaction), with your business value AND the Partition / offsets where it comes from. That way , if your consumer crash before committing to Kafka, you'll be able to get back the last Kafka offset it has processed ( by using consumer.seek())

It can though be a quite data overhead in your sgbd ( keeping offset/partition for all your rows), but you might be able to optimize a bit.

Yannick

Upvotes: 1

radai
radai

Reputation: 24202

here's what i think you mean:

  1. consumer X sees record Y, and "acts" on it, yet does not commit its offset
  2. consumer X crashes (still without committing its offsets)
  3. consumer X boots back up, is re-assigned the same partition (not guaranteed) and eventually sees record Y again

this is totally possible. however, for kafka exactly once to "work" all of your side effects (state, output) must also go into the same kafka cluster. so here's whats going to happen:

  1. consumer X starts a transaction
  2. consumer X sees record Y, emits some output record Z (as part of the transaction started in 1)
  3. consumer X crashes. shortly after the broker acting as the transaction coordinator "rolls back" (im simplifying) the transaction started in 1, meaning no other kafka consumer will ever see record Z
  4. consumer X boots back up, is assigned the same partition(s) as before, starts a new transaction
  5. consumer X sees record Y again, emits record Z2 (as part of the transaction started in 4)
  6. some time later consumer X commits its offsets (as part of the transaction from 4) and then commits that transaction
  7. record Z2 becomes visible to downstream consumers.

if you have side-effects outside of the same kafka cluster (say instead of record Z you insert a row into mysql) there's no general way to make kafka exactly-once work for you. you'd need to rely on oldschool dedup and idempotance.

Upvotes: 7

Related Questions