Hazelcast best practices to failover in parallel processing

Question

I'm new to Hazelcast. So have a question about best failure handling practices during parallel processing:

Mastering Hazelcast, section 6.6, p. 96:

Work-queue has no high availability: Each member will create one or more local ThreadPoolExecutors with ordinary work-queues that do the real work. When a task is submitted, it will be put on the work-queue of that ThreadPoolExecutor and will not be backed up by Hazelcast. If something would happen with that member, all unprocessed work will be lost.

Task:

Suppose I've got 1 master node and 2 slaves. I launch time consuming task with

executor.submitToAllMembers (new TimeConsumingTask())

So each node is processing something. And while they all are processing something one of the slaves fails

Questions:

That's not possible to rerun the failed member work on another node, right?
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Runnable)
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Callable and I want to get a Future as a cluster computation result)

Mike · Accepted Answer

I'm assuming by 'failure handling' you're talking about the scenario where a node in the cluster goes down....

Question 1 Not automatically. You are right in assuming that Hazelcast's execution tasks are not fault tolerant. However, if you were able to handle the failure of a task, I can't see a reason why you couldn't resubmit the work to another member in the cluster.

Question 2 It's difficult to know what your TimeConsumingTask is actually doing - as with any distributed execution engine, it's generally better to compose the long running task as a series of smaller tasks. If you can't compose your task as smaller elements, then no - there's not really a better approach than resubmitting the whole job again

Question 3 The same thing applies to this question as question 2. Returning a Future from a task submission is not going to help you massively if a node fails. Futures provide you with the ability to wait (optionally for a specified timeout period) on the result and provide the possibility of cancelling the task.

Generally, for handling a node failing I would take a look to see whether an ExecutionCallback would help - in this case you get notified on a failure, which I am currently assuming that a node failure falls under this. When your callback is notified of the failure, you could resubmit the job.

You might also want to look at some other approaches that exist outside of the core Hazelcast API. Hazeltask is a project on GitHub that promises failover handling and task resubmission - so that might be worth a look?

Hazelcast best practices to failover in parallel processing

Answers (1)

Related Questions