Nishat
Nishat

Reputation: 921

Kafka: Is it good practice too keep topic offset in database?

I have started learning kafka. I don't have much idea of live project where kafka is used. Wanted to know if offset can be saved in database apart from committing in broker? I think it should always be saved otherwise some record will be missed or re-processed. Taking an example if offset is not saved in database, when application(consumer) is deployed or restarted during that time if some message is sent to broker at that time, that will be missed as when consumer will be up it will read next onward record or(from start)

Upvotes: 2

Views: 2831

Answers (1)

radai
radai

Reputation: 24192

the short answer to your question is "its complicated" :-)

the long answer to your question is something like:

  1. kafka (without extra configuration and/or careful design of your code) is an at-least-once system (see official documentation). this means that yes, your consumer may see a particular set of records more than once. this wont happen on a graceful shutdown/rebalance, but will definitely happen if your application crashes.
  2. newer versions of kafka support so called "exactly once". this involves configuring your clients differently (and a significant performance and latency hit), and the guarantees only ever hold if all your inputs and outputs are from/to the exact same kafka cluster. so if your consumer does anything like call an external HTTP API or insert into a database in response to seeing a kafka record we are back to at-least-once.
  3. if your outputs go to a transactional system (like a classic ACID database) a common pattern would be to start a transaction, and in that transaction record both your outputs and the consumer offsets (you would also need to change your code to restore from these DB offsets and not the kafka default). this has better guarantees (but still wont help if your code interacts with non-transactional systems, like making an HTTP call)
  4. another common design pattern to overcome at-least-once is to somehow "tag" every operation you do (record you produce, http call you make ...) with some UUID that derives from the original kafka records comsumed to produce this output. this means if your consumer sees the same record again, it will perform the same operations again, and repeat the same "tag" value. this shifts the burden to downstream systems that must now remember (at least for some period of time) all the "tags" they have seen so they could disregard a repeat operation, or somehow design all your operations to be idempotent

Upvotes: 6

Related Questions