user3784318
user3784318

Reputation: 31

ClusterSingletonManager not failing over

I have been testing a Master / Worker cluster with the following set up:

I was testing the fail over of the Masters by manually shutting down the "Active" Master node. In the scenario where the Workers are not processing tasks the fail over works fine. The "Non-active" Master node does detect the other node as unreachable and eventually will start it's Master actor.

But if the workers are busy then fail over does not completely work. The "Non-active" Master node does detect the other as unreachable and quarantines as indicated in the below message but the node never starts the Master actor.

2014-07-23 23:52:31,777 INFO [JobRunner-akka.actor.default-dispatcher-17] Quarantined address [akka.tcp://[email protected]:40000] is still unreachable or has not been restarted. Keeping it quarantined.

Anybody have any ideas why this is happening and if there is solution to this?

Thanks. Regards.

Upvotes: 1

Views: 253

Answers (2)

user3784318
user3784318

Reputation: 31

In the end putting the Master nodes on to their own servers (separate from the Workers) worked.

Upvotes: 1

Which version of Akka are you using? There has been improvements in heartbeat priotization recently – please upgrade to 2.3.4 and check.

Upvotes: 0

Related Questions