Douglas
Douglas

Reputation: 2734

Do I need Kafka to have a reliable Storm spout?

As I understand things, ZooKeeper will persist tuples emitted by bolts so if a bolt crashes (or a computer with the bolt crashes, or the entire cluster crashes), the tuple emitted by the bolt will not be lost. Once everything is restarted, the tuples will be fetched from ZooKeeper, and everything will continue on as if nothing bad ever happened.

What I don't yet understand is if the same thing is true for spouts. If a spout emits a tuple (i.e., the emit() function within a spout is executed), and the computer the spout is running on crashes shortly thereafter, will that tuple be resurrected by ZooKeeper? Or do we need Kafka in order to guarantee this?

P.S. I understand that the tuple emitted by the spout must be assigned a unique ID in the call to emit().

P.P.S. I see sample code in books that uses something like ConcurrentHashMap<UUID, Values> to track which spouted tuples have not yet been acked. Is this somehow automatically persisted with ZooKeeper? If not, then I shouldn't really be doing that, should I? What should I being doing instead? Using Kafka?

Upvotes: 0

Views: 521

Answers (1)

Douglas
Douglas

Reputation: 2734

Florian Hussonnois answered my question thoroughly and clearly in this storm-user thread. This was his answer:

Actually, the tuples aren't persisted into "zookeeper". If your "spout" emits a tuple with a unique id, it will be automatically follow internally by storm (i.e ackers) . Thus, in case the emitted tuple comes to fail because of a bolt failure, Storm invokes the method 'fail' on the origin spout task with the unique id as argument.

It's then up to you to re-emit the failed tuple.

In sample codes, spouts use a Map to track which tuples are fully processed by your entire topology in order to be able to re-emit in case of a bolt failure.

However, if the failure doesn't come from a bolt but from your spout, the in memory Map will be lost and your topology will not be able to remit failed tuples.

For a such scenario you can rely on Kafka. In fact, the Kafka Spout store its read offset into zookeeper. In that way, if a spout task goes down it will be able to read its offset from zookeeper after restarting.

Upvotes: 1

Related Questions