Reputation: 157
I would like to understand the integration between DeployerPartitionHandler
and DeployerStepExecutionHandler
during Remote Partitioning.
How does the start time, end time, execution status of the parent task execution is updated when there are multiple workers?
What happens if one of the worker process becomes unresponsive because of some external reasons? Is there a way to handle this situation programmatically? i.e., to kill the unresponsive process and fail the step.
Thanks in advance for inputs!!
Upvotes: 0
Views: 612
Reputation: 21463
You have a number of questions here so let me answer them one at a time.
How does the start time, end time, execution status of the parent task execution is updated when there are multiple workers?
All components within this architecture are tasks. The parent is a task, the workers are each tasks, so they all update the task repository independently. The parent application will mark the start time at the beginning of the task (before any CommandLineRunner
or ApplicationRunner
implementations are called). It will update the end time and results once all the workers are done (since the remote partitioned step won't complete until all the workers have completed or timed out).
What happens if one of the worker process becomes unresponsive because of some external reasons?
The deployers used by the DeployerPartitionHandler
depend on a platform (CloudFoundry, Kubernetes, etc) for production use. Each of these platforms handle hung processes in their own way so the answer to this question is really platform specific. In most cases, if a process is identified as not healthy (by whatever definition the platform uses) it will be shut down.
Is there a way to handle this situation programmatically? i.e., to kill the unresponsive process and fail the step.
If a partition fails during the execution, the parent will be also marked as failed and can be restarted. On a restart (by default), only the failed partitions will be re-run. Any partitions that are already complete will not be re-executed.
Upvotes: 1