nair.ashvin
nair.ashvin

Reputation: 801

Akka Remoting Failures with Amazon EC2

I'm building a library with Akka actors in Scala to do some large-scale data crunching.

I'm running my code on Amazon EC2 spot instances using StarCluster. The program is unstable because the actor remoting sometimes drops:

While the code is running, nodes will disconnect one by one in a few minutes. The nodes say something like:

[ERROR] [07/16/2014 17:40:06.837] [slave-akka.actor.default-dispatcher-4] [akka://slave/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fslave%40master%3A2552-0/endpointWriter] AssociationError [akka.tcp://slave@node005:2552] -> [akka.tcp://slave@master:2552]: Error [Association failed with [akka.tcp://slave@master:2552]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://slave@master:2552]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: master

and

[WARN] [07/16/2014 17:30:05.548] [slave-akka.actor.default-dispatcher-12] [Remoting] Tried to associate with unreachable remote address [akka.tcp://slave@master:2552]. Address is now quarantined, all messages to this address will be delivered to dead letters.

Even though I can ping between the nodes just fine.

I've been trying to fix this; I've figured it's some configuration setting. The Akka remoting documentation even says,

However in cloud environments, such as Amazon EC2, the value could be increased to 12 in order to account for network issues that sometimes occur on such platforms.

However, I've set that and beyond and still no luck in fixing the issue. Here are my current remoting configurations:

akka {
  actor {
    provider = "akka.remote.RemoteActorRefProvider"
  }
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      port = 2552
      # for modelling
      #send-buffer-size = 50000000b
      #receive-buffer-size = 50000000b
      #maximum-frame-size = 25000000b
      send-buffer-size = 5000000b
      receive-buffer-size = 5000000b
      maximum-frame-size = 2500000b
    }
    watch-failure-detector.threshold = 100
    acceptable-heartbeat-pause = 20s
    transport-failure-detector {
      heartbeat-interval = 4 s
      acceptable-heartbeat-pause = 20 s
    }
  }
  log-dead-letters = off
}

and I deploy my actors like so all from the master node:

val o2m = system.actorOf(Props(classOf[IntOneToMany], p), name = "o2m")
val remote = Deploy(scope = RemoteScope(Address("akka.tcp", "slave", args(i), 2552)))
val b = system.actorOf(Props(classOf[IntBoss], o2m).withDeploy(remote), name = "boss_" + i)
etc.

Can anyone point me to a mistake I'm making/how I can fix this problem and stop nodes from disconnecting? Alternatively, some solution of just re-launching the actors if they are disconnected also works; I don't care about dropped messages much. In fact I thought this was supposed to be easily configurable behavior but I'm finding it difficult to find the right place to look for that.

Thank you

Upvotes: 0

Views: 1357

Answers (1)

Haiying Wang
Haiying Wang

Reputation: 652

at least the properties syntax was wrong: acceptable-heartbeat-pause should be under watch-failure-detector, (yours are at the same level). they should be like below:

watch-failure-detector {
  threshold = 100
  acceptable-heartbeat-pause = 20 s
}
transport-failure-detector {
  heartbeat-interval = 4 s
  acceptable-heartbeat-pause = 20 s
}

Upvotes: 2

Related Questions