Reputation: 801
I'm building a library with Akka actors in Scala to do some large-scale data crunching.
I'm running my code on Amazon EC2 spot instances using StarCluster. The program is unstable because the actor remoting sometimes drops:
While the code is running, nodes will disconnect one by one in a few minutes. The nodes say something like:
[ERROR] [07/16/2014 17:40:06.837] [slave-akka.actor.default-dispatcher-4] [akka://slave/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fslave%40master%3A2552-0/endpointWriter] AssociationError [akka.tcp://slave@node005:2552] -> [akka.tcp://slave@master:2552]: Error [Association failed with [akka.tcp://slave@master:2552]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://slave@master:2552]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: master
and
[WARN] [07/16/2014 17:30:05.548] [slave-akka.actor.default-dispatcher-12] [Remoting] Tried to associate with unreachable remote address [akka.tcp://slave@master:2552]. Address is now quarantined, all messages to this address will be delivered to dead letters.
Even though I can ping between the nodes just fine.
I've been trying to fix this; I've figured it's some configuration setting. The Akka remoting documentation even says,
However in cloud environments, such as Amazon EC2, the value could be increased to 12 in order to account for network issues that sometimes occur on such platforms.
However, I've set that and beyond and still no luck in fixing the issue. Here are my current remoting configurations:
akka {
actor {
provider = "akka.remote.RemoteActorRefProvider"
}
remote {
enabled-transports = ["akka.remote.netty.tcp"]
netty.tcp {
port = 2552
# for modelling
#send-buffer-size = 50000000b
#receive-buffer-size = 50000000b
#maximum-frame-size = 25000000b
send-buffer-size = 5000000b
receive-buffer-size = 5000000b
maximum-frame-size = 2500000b
}
watch-failure-detector.threshold = 100
acceptable-heartbeat-pause = 20s
transport-failure-detector {
heartbeat-interval = 4 s
acceptable-heartbeat-pause = 20 s
}
}
log-dead-letters = off
}
and I deploy my actors like so all from the master node:
val o2m = system.actorOf(Props(classOf[IntOneToMany], p), name = "o2m")
val remote = Deploy(scope = RemoteScope(Address("akka.tcp", "slave", args(i), 2552)))
val b = system.actorOf(Props(classOf[IntBoss], o2m).withDeploy(remote), name = "boss_" + i)
etc.
Can anyone point me to a mistake I'm making/how I can fix this problem and stop nodes from disconnecting? Alternatively, some solution of just re-launching the actors if they are disconnected also works; I don't care about dropped messages much. In fact I thought this was supposed to be easily configurable behavior but I'm finding it difficult to find the right place to look for that.
Thank you
Upvotes: 0
Views: 1357
Reputation: 652
at least the properties syntax was wrong: acceptable-heartbeat-pause should be under watch-failure-detector, (yours are at the same level). they should be like below:
watch-failure-detector {
threshold = 100
acceptable-heartbeat-pause = 20 s
}
transport-failure-detector {
heartbeat-interval = 4 s
acceptable-heartbeat-pause = 20 s
}
Upvotes: 2