Reputation: 20090
In my topology, I need to perform an insert statement for every tuple that passes through it. In order to be nice to my database, I am batching the inserts using the tick tuple pattern.
Posts I see online instruct to implement the pattern as follows:
-collect tuples in batch
-flush batch when tick occurs (or maybe when batch grows over a certain size)
-ack all the tuples in the batch
But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?
If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed. And if flushing my batch fails, the batch will not clear upon exception and all the messages in it will try to be inserted again the next time a tick occurs?
Upvotes: 1
Views: 1043
Reputation: 62285
I cannot follow your description completely. However, you should to the following:
flush tuples (tick or size)
As re-try pattern, you could use for example the next batch thats fills up or the next tick tuple. For this case, you just allow a larger batch size or try to insert two batches after each other.
If you would ack tuples before an successful insertion into the database, you might loose the tuple if the bolt crashed. After acking the tuples, Storm allows the Spout to drop the source tuples that are required to recompute the not-yet inserted tuples. And therefore, you cannot recompute them.
As an alternative, you could also fail all tuples from the batch (if inserting is not possible) and trigger the spout to replay the source tuples. This has the advantage, that you do not build up larger/multiple batches in your DB-insert-bolt. However, the disadvantage is of course that Storm has to process those tuples twice.
Upvotes: 0
Reputation: 2647
If I ack the tuples prior to batching, and instead batch some Object based on the tuple contents, then the tuples will not be replayed.
Yes, you indeed correct; this is why you should only ack them after the batch suceeds. You do want all messages to be processed right?
But, why do I want to wait until I flush the batch to ack my tuples? What if flushing the batch has an exception (like a database timeout/error), won't all the tuples in the batch eventually timeout and get replayed?
Yes, the tuples will be replayed by timeout. However you should fail them (or retry the batch) if the batches fails.
Now let me give you an additional piece of advice, you don't want tuples to be replayed; it will cause a huge performance degradation on the data source, for example Kafka is very fast because it performs sequential reading, a tuple replay makes kafka go seek the tuple to be replayed. Therefore you should:
not null constraint
in the database and your tuple field is null. In this case you should ack the tuple, because you will never be able to insert this tuple in the database.Upvotes: 1