Reputation: 103
I have a small cluster of 4 nodes, each with 4 cores. I can happily run HP Linpack on one node, but I'm struggling to get it to run on multiple nodes.
I compiled HPL-2.3 from source with OpenMPI and OpenBLAS. All seems to work well with single node tests.
My 'nodes' file is:
192.168.0.1 slots=4
192.168.0.2 slots=4
192.168.0.3 slots=4
192.168.0.4 slots=4
If I run mpirun -np 16 -hostfile nodes uptime
I get the following:
19:10:49 up 8:46, 1 user, load average: 0.05, 0.53, 0.34
19:10:49 up 8:46, 1 user, load average: 0.05, 0.53, 0.34
19:10:49 up 8:46, 1 user, load average: 0.05, 0.53, 0.34
19:10:49 up 9 min, 0 users, load average: 0.08, 0.06, 0.03
19:10:49 up 9 min, 0 users, load average: 0.08, 0.06, 0.03
19:10:49 up 9 min, 0 users, load average: 0.08, 0.06, 0.03
19:10:49 up 8:46, 1 user, load average: 0.05, 0.53, 0.34
19:10:49 up 37 min, 0 users, load average: 0.08, 0.02, 0.01
19:10:49 up 37 min, 0 users, load average: 0.08, 0.02, 0.01
19:10:49 up 37 min, 0 users, load average: 0.08, 0.02, 0.01
19:10:49 up 20 min, 0 users, load average: 0.00, 0.02, 0.00
19:10:49 up 9 min, 0 users, load average: 0.08, 0.06, 0.03
19:10:49 up 20 min, 0 users, load average: 0.00, 0.02, 0.00
19:10:49 up 20 min, 0 users, load average: 0.00, 0.02, 0.00
19:10:49 up 37 min, 0 users, load average: 0.08, 0.02, 0.01
19:10:49 up 20 min, 0 users, load average: 0.00, 0.02, 0.00
which, suggest to me, that OpenMPI is working and distributing uptime
to 4 processor, 16 cores.
However, when I run mpirun -np 16 -hostfile nodes xhpl
I get the following:
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 8; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: 192.168.0.3
Executable: /home/ucapjbj/phas0077/projects/hpl-2.3/bin/arch/xhpl
This suggest to me that xhpl
cannot be found on node 192.168.0.3
, which seems reasonable, since it is only present on 192.168.0.1
, which is my development node. But conceptually, I was under the impression I could develop on one node, and then have OpenMPI distribute the executable to the other nodes for execution without having to copy the executable to the other nodes beforehand. Have I fundamentally misunderstood this?
Any guidance would be much appreciated.
Kind regards
John
Upvotes: 0
Views: 249
Reputation: 103
It appears I have to copy the 'xhpl' executable to the same location on each node.
I've looked at the mpirun --preload-binary
option, which would appear to be exactly what I want, but I can't get this to work. Any advice would be very welcome.
Best wishes
John
Upvotes: 0