ab11
ab11

Reputation: 20090

Storm: When batching with tick tuples, why wait to ack tuples?

In my topology, I need to perform an insert statement for every tuple that passes through it. In order to be nice to my database, I am batching the inserts using the tick tuple pattern.

Posts I see online instruct to implement the pattern as follows:

-collect tuples in batch

-flush batch when tick occurs (or maybe when batch grows over a certain size)

-ack all the tuples in the batch

But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?

If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed. And if flushing my batch fails, the batch will not clear upon exception and all the messages in it will try to be inserted again the next time a tick occurs?

Upvotes: 1

Views: 1043

Answers (2)

Matthias J. Sax
Matthias J. Sax

Reputation: 62285

I cannot follow your description completely. However, you should to the following:

  1. collect tuple in batch
  2. flush tuples (tick or size)

    • on successful insert transaction, ack all tuples of batch
    • on insert failure, no not ack (and try to insert later again, until insert was successful)

As re-try pattern, you could use for example the next batch thats fills up or the next tick tuple. For this case, you just allow a larger batch size or try to insert two batches after each other.

If you would ack tuples before an successful insertion into the database, you might loose the tuple if the bolt crashed. After acking the tuples, Storm allows the Spout to drop the source tuples that are required to recompute the not-yet inserted tuples. And therefore, you cannot recompute them.

As an alternative, you could also fail all tuples from the batch (if inserting is not possible) and trigger the spout to replay the source tuples. This has the advantage, that you do not build up larger/multiple batches in your DB-insert-bolt. However, the disadvantage is of course that Storm has to process those tuples twice.

Upvotes: 0

SQL.injection
SQL.injection

Reputation: 2647

If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed.

Yes, you indeed correct; this is why you should only ack them after the batch suceeds. You do want all messages to be processed right?

But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?

Yes, the tuples will be replayed by timeout. However you should fail them (or retry the batch) if the batches fails.


Now let me give you an additional piece of advice, you don't want tuples to be replayed; it will cause a huge performance degradation on the data source, for example Kafka is very fast because it performs sequential reading, a tuple replay makes kafka go seek the tuple to be replayed. Therefore you should:

  1. if the batch fails, inspect if the tuples can actually be inserted into the database. For example you might have a not null constraint in the database and your tuple field is null. In this case you should ack the tuple, because you will never be able to insert this tuple in the database.
  2. You should retry insert the tuples, before failing them
  3. You want to fail the tuples instead of letting them timeout. It's not a good practice wait for the tuples to timeout, fail them instead. You can see on Storm UI on which bolt the tuples are failing, you can't see on which bolt the tuples are timing out.
  4. Log tuple failures, because if a tuple cannot be inserted (remember for example the not null constraint) you want to know this kind of things and change your code to handle this situation (eg. advice 1, but there are others).

Upvotes: 1

Related Questions