swartchris8
swartchris8

Reputation: 720

How to set processing timeouts in apache beam / Dataflow python batch jobs?

I am currently using the stopit library https://github.com/glenfant/stopit to set per element processing timeouts in batch jobs. These jobs work on the direct runner and I am able to timeout functions that take too long.

What is the beam way of setting a per element process timeout for a batch job?

Is there a way I could set a processing timeout with a trigger for a dataflow batch job?

My use case is extracting named entities from a text. The NER process sometimes takes too long if the document being processed is too long.

It would be nice to get rid of this dependency and move to a beam native solution.

Upvotes: 1

Views: 2310

Answers (1)

deepak sen
deepak sen

Reputation: 507

As per my understanding the answer to the question is timely processing which also maintain state. Lets say you have a function f and get the output for any batch that will use the function. So basically we have to mark the batch(update a state) after we receive a batch and we will set timers that will update the output as per the watermark/expiry timing which can be set, if there is any output we will receive it and if we don't any output from the function as per your query, we can surely re-route that batch.

This is not an exact solution, but can be worked out

For better understanding you can read it here: apache beam documentation enter image description here

Upvotes: 2

Related Questions