Reputation: 20090
During my stream my tuples do not split in anyway. One action is eventually performed out for each tuple in.
I can still fail them if they run into some sort of exception that they might conquer if replayed by my KafkaSpout
. Though, I don't know how my spout knows which tuple to replay when they're not anchored, but in testing it seems to replay the right one. Is this expected, the KafkaSpout
implementation tracks tuples/messages in some way I'm not aware of? Am I possibly anchoring an not realizing it (my bolts extend BaseRichBolt
)? Possibly I'm just mistaken that it replays the correct one?
But if manually failing does work, then I believe the only benefit I get from anchoring is that my tuple will be replayed when it times out -- which I'm not sure is worth the overhead of anchoring.
Am I correct about this? Is there some other significant benefit to anchoring in this case?
Upvotes: 5
Views: 877
Reputation: 62310
BaseRichBolt
does not do any anchoring automatically (BaseBasicBolt
would do this). Thus, the behavior you describe should only work if you have simple Spout -> Bolt topology. For deeper topologies, ie, Spout -> Bolt1 -> Bolt2 and no anchoring in Bolt1, failing of tuples in Bolt2 cannot work.
Using KafkaSpout
each tuple emitted gets a MessageId
assigned, thus fault-tolerance mechanism is activated. Thus, each tuple must get acked in the first Bolt receiving those tuples; otherwise, the tuples time-out eventually. Tuples emitted in Bolt1 should get anchored (otherwise, those tuples get not tracked, cannot fail---neither manually in Bolt2 or per time-out---and cannot get replayed in case of failure).
Thus, anchoring is a pure fault-tolerance mechanism. You should actually always anchor tuples because anchoring itself does not enable fault-tolerance; assigning MessageId
s in Spout does enable it. If a Bolt processes a tuple that does not have an ID assigned, the anchoring call will do "nothing" and the overhead of an additional method call is tiny. Therefore, adding anchoring code is usually a good choice, because the Bolt can be used with or without fault-tolerance enabled (depending if the Spout assigns messaged IDs or not). If you omit the anchoring code, fault-tolerance will break in this Bolt and downstream tuples cannot get recovered on failure.
Upvotes: 5