How to ensure exactly once semantics while processing kafka messages in Apache Storm

Question

I needed exactly once delivery in my app. I explored kafka and realised that to have message produced exactly once, I have to set idempotence=true in producer config. This also sets acks=all, making producer resend messages till all replicas have committed it. To ensure that consumer does not do duplicate processing or leave any message unprocessed, it is advised to commit the processing output and offset to external database in same database transaction, so that either both of them will be persisted or none avoiding duplicate and no processing.

In consumer, message is left processed if consumer first commits it but fails before processing it and message is processed more than once if consumers first processes it but fails before committing it.

Q1. Now I was guessing how can I imitate the same with Apache Storm. I guess exactly once production of message can be ensured by setting idemptence=true in KafkaBolt. Am I right?

I was guessing how I can ensure missed and duplicate message processing in Storm. For example, this doc page says if I anchor a tuple (by passing it as first parameter to OutputCollector.emit()) and then pass the tuple to OutputCollector.ack() or OutputCollector.fail(), Storm will ensure data loss. This is what it exactly says:

Now that you understand the reliability algorithm, let's go over all the failure cases and see how in each case Storm avoids data loss:

A tuple isn't acked because the task died: In this case the spout tuple ids at the root of the trees for the failed tuple will time out and be replayed.

Acker task dies: In this case all the spout tuples the acker was tracking will time out and be replayed.

Spout task dies: In this case the source that the spout talks to is responsible for replaying the messages. For example, queues like Kestrel and RabbitMQ will place all pending messages back on the queue when a client disconnects.

Q2. I guess this ensures that message is not left unprocessed, but does not avoid duplicate processing of messages. Am I correct with this? Also is there anything else that Storm offers to ensure exactly once semantics like kafka that I am missing?

How to ensure exactly once semantics while processing kafka messages in Apache Storm

Answers (1)

Related Questions