DailRowe
DailRowe

Reputation: 31

ipcluster - can't start more than about 110 ipengines - or maybe some of them die

I'm having difficulty getting ipcluster to start all of the ipengines that I ask for. It appears to be some sort of timeout issue. I'm using IPython 2.0 on a linux cluster with 192 processors. I run a local ipcontroller, and start ipengines on my 12 nodes using SSH. It's not a configuration problem (at least I don't think it is) because I'm having no problems running about 110 ipengines. When I try for a larger amount, some of them seem to die during start up, and I do mean some of them - the final number I have varies a little. ipcluster reports that all engines have started. The only sign of trouble that I can find (other than not having use of all of the requested engines) is the following in some of the ipengine logs:

2014-06-20 16:42:13.302 [IPEngineApp] Loading url_file u'.ipython/profile_ssh/security/ipcontroller-engine.json'
2014-06-20 16:42:13.335 [IPEngineApp] Registering with controller at tcp://10.1.0.253:55576
2014-06-20 16:42:13.429 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2014-06-20 16:42:13.434 [IPEngineApp] Using existing profile dir: u'.ipython/profile_ssh'
2014-06-20 16:42:13.436 [IPEngineApp] Completed registration with id 49
2014-06-20 16:42:25.472 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 18:09:12.782 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 19:14:22.760 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).
2014-06-20 20:00:34.969 [IPEngineApp] WARNING | No heartbeat in the last 3010 ms (1 time(s) in a row).

I did some googling to see if I could find some wisdom, and the only thing I've come across is http://permalink.gmane.org/gmane.comp.python.ipython.devel/12228. The author seems to think it's a timeout of sorts.

I also tried tripling (90 seconds as opposed to the default 30) the IPClusterStart.early_shutdown and IPClusterEngines.early_shutdown times without any luck.

Thanks - in advance - for any pointers on getting the full use of my cluster.

Upvotes: 3

Views: 516

Answers (1)

Ivelin
Ivelin

Reputation: 13309

When I try execute ipcluster start --n=200 I get: OSError: [Errno 24] Too many open files
This could be what happens to you too. Try raising the open file limit of the OS.

Upvotes: 1

Related Questions