Storm: When batching with tick tuples, why wait to ack tuples?

Question

In my topology, I need to perform an insert statement for every tuple that passes through it. In order to be nice to my database, I am batching the inserts using the tick tuple pattern.

Posts I see online instruct to implement the pattern as follows:

-collect tuples in batch

-flush batch when tick occurs (or maybe when batch grows over a certain size)

-ack all the tuples in the batch

But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?

If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed. And if flushing my batch fails, the batch will not clear upon exception and all the messages in it will try to be inserted again the next time a tick occurs?

SQL.injection · Accepted Answer

If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed.

Yes, you indeed correct; this is why you should only ack them after the batch suceeds. You do want all messages to be processed right?

But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?

Yes, the tuples will be replayed by timeout. However you should fail them (or retry the batch) if the batches fails.

Now let me give you an additional piece of advice, you don't want tuples to be replayed; it will cause a huge performance degradation on the data source, for example Kafka is very fast because it performs sequential reading, a tuple replay makes kafka go seek the tuple to be replayed. Therefore you should:

if the batch fails, inspect if the tuples can actually be inserted into the database. For example you might have a not null constraint in the database and your tuple field is null. In this case you should ack the tuple, because you will never be able to insert this tuple in the database.
You should retry insert the tuples, before failing them
You want to fail the tuples instead of letting them timeout. It's not a good practice wait for the tuples to timeout, fail them instead. You can see on Storm UI on which bolt the tuples are failing, you can't see on which bolt the tuples are timing out.
Log tuple failures, because if a tuple cannot be inserted (remember for example the not null constraint) you want to know this kind of things and change your code to handle this situation (eg. advice 1, but there are others).

Storm: When batching with tick tuples, why wait to ack tuples?

Answers (2)

Related Questions