Arnold Galovics
Arnold Galovics

Reputation: 3416

Spring Batch - Resume already started job when master fails

I didn't find an answer to my question so far so I'm giving it a try here.

Let's assume a Spring Batch application with remote partitioning. There's one master/manager application partitioning the dataset and sending it to Kafka (to multiple partitions) and worker nodes are consuming from different Kafka partitions to run in parallel. So far so good.

The question is, what happens if while the workers are still processing the data and doing their own things, the manager application suddenly crashes.

The obvious answer is that the partitioning job execution will stay as STARTED even though the respective worker jobs have the state COMPLETED.

How can I restart the master node without doing the partitioning again and triggering the workers? The only thing I'd want in this case is to mark that particular job execution as COMPLETED since all the worker steps have completed.

I tried restarting the job with the JobOperator interface but obviously it fails since the job is in STARTED state and not in FAILED.

Caused by: org.springframework.batch.core.UnexpectedJobExecutionException: Illegal state (only happens on a race condition): job execution already running with name=partitioningJob and parameters={}

Any suggestions are welcome. Thanks!

Upvotes: 2

Views: 593

Answers (1)

Mahmoud Ben Hassine
Mahmoud Ben Hassine

Reputation: 31600

You can change the status of the job execution from STARTED to FAILED and set its END_TIME to a non null value before restarting the same job instance (you might need to do that for the step execution as well if needed). On restart, the manager should notice that all workers have completed and will complete the execution.

Upvotes: 0

Related Questions