Reputation: 808
I would like to ask you for an explanation what are the "InfiniBand-Stacks". Those were recently changed on our machine and I started running into MPI communication failures. I need some information in order to understand how this might be affecting the stability of my parallel jobs.
The actual error message I got was :
A process failed to create a queue pair. This usually means either the device has run out of queue pairs (too many connections) or there are insufficient resources available to allocate a queue pair (out of memory). The latter can happen if either 1) insufficient memory is available, or 2) no more physical memory can be registered with the device.
[connect/btl_openib_connect_oob.c:867:rml_recv_cb] error in endpoint reply start connect
Upvotes: 0
Views: 285
Reputation: 1755
The "openib" in that message suggests that it's your OpenFabrics OFED that changed and might be causing problems: https://www.openfabrics.org/index.php. See if you can change that out or isolate other parts of the software stack like the OpenMPI version and application code.
Also, if you're using IMPI, contact Intel for support. The recommendation to check with OpenMPI was a good one just based on how many users are out there, but they can't help much with Intel products.
Upvotes: 0
Reputation: 9062
Usually when someone is talking about some sort of "stack" when it relates to software, they mean the drivers/libraries/etc. that control a particular piece of hardware. For instance, the network "stack" may mean all of the layers of network software between your application and the physical network interface card (NIC). That's probably what you mean in this instance.
Of course, there's the other kind of software stack relating to memory allocation, but that's not what this is about.
Anyway, if you didn't change anything in your application (including the environment in which you run it) and you system administrators recently updated the InfiniBand drivers, it's possible that there's some sort of bug between Open MPI and you InfiniBand library. That's not usually the case, but you can probably find out by asking the Open MPI guys directly. A few of them hang out here on SO, but for the most part, you'll need to contact them directly by emailing users [at] open-mpi [dot] org
.
Upvotes: 1