Getting mpirun to recognize all cores on each node

Question

I have set up two nodes for MPI, aml1 (master) and aml2 (worker). I am trying to use mpirun with R scripts and using Rmpi and doMPI libraries. The specs for both machines are the same:

On RHEL 7.3
# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 45
Model name:            Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Stepping:              7
CPU MHz:               2900.000
BogoMIPS:              5790.14
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31

If you care to see hwloc lstopo output.

I am using OpenMPI 1.10.5 and I can see processes running on aml1 and aml2. However, I don't see my test script running any faster when I increase the number of workers that are spawned from mpirun and thus I don't see any decrease in computation time. This makes me assume that mpirun isn't properly detecting how many cores are available, or that I am assigning it incorrectly in the hostfile or rankfile.

If I change my hostfile or rankfile to different values of slots:

$ cat hosts
aml1 slots=4 max_slots=8  #I can change this to 10 slots
aml2 slots=4

$ cat rankfile
rank 0=aml1 slot=0:0   
rank 1=aml1 slot=0:1
rank 2=aml1 slot=0:2
rank 3=aml1 slot=0:3
rank 4=aml2 slot=0:6
rank 5=aml2 slot=0:7    #I can add more ranks

And then I run:

$ mpirun -np 1 --hostfile hosts --rankfile rankfile R --slave -f example7.R

$ cat example7.R
library(doMPI)
cl <- startMPIcluster(verbose=TRUE)
registerDoMPI(cl)

system.time(x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
 set.seed(seed)
 rnorm(90000000)
 })

closeCluster(cl)
mpi.quit(save="no")

I still get the similar elapsed system times:

Spawning 5 workers using the command:
 5 slaves are spawned successfully. 0 failed.
   user  system elapsed
  9.023   7.396  16.420

Spawning 25 workers using the command:
 25 slaves are spawned successfully. 0 failed.
   user  system elapsed
  4.752   8.755  13.508

I've also tried setting up Torque and building openmpi with the tm configure option, but I'm having separate issues with that. I believe I don't necessary need to use Torque to accomplish what I want to do, but please confirm if I am incorrect.

What I want to do is run an R script with Rmpi and doMPI. The R script itself should only be run once, with a section of code spawned out to the cluster. I want to maximize the cores available on both nodes (aml,aml2).

Appreciate any help from the community!

Update 1

Here's a bit more detail: I run the following, changing the hostfile for each run:

$ mpirun -np 1 --hostfile hosts [using --map-by slot or node] R --slave -f example7.R
+----------------+-----------------+-----------------+
|                | //--map-by node | //--map-by slot |
+----------------+-----------------+-----------------+
| slots per host | time            | time            |
| 2              | 24.1            | 24.109          |
| 4              | 18              | 12.605          |
| 4              | 18.131          | 12.051          |
| 6              | 18.809          | 12.682          |
| 6              | 19.027          | 12.69           |
| 8              | 18.982          | 12.82           |
| 8              | 18.627          | 12.76           |
+----------------+-----------------+-----------------+

Should I be getting reduced times? Or is this as good as it gets? I feel like I should be able to increase my slots per host to 30 for peak performance, but it peaks around 4 slots per host.

Daren Eiri · Accepted Answer

I think I found an answer to my own question.

Since I am new to this, I was under the assumption that Torque would automatically use all the "cores" available on a machine/node. Since I have 32 cores, I was expecting 32 workers spawning per node. But actually, there are 16 physical cores, with each of those 16 cores having hyperthreading, which makes 16x2 cores available on a machine. From my understanding Torque launches one process per processor (or physical core in this example). So I shouldn't be expecting 32 workers to be spawned per node.

I reviewed more information about NUMA support, and per Open MPI FAQ, RHEL typically requires numactl-devel packages to be installed prior to build to support memory affinity. So I did that for each node, and I am actually able to run an R script through Torque, defining 8 cores, or 16 cores per node. Now the computation times are quite similar. If I increase to 18/20 cores per node, then performance drops as expected.

Below are my .configure options for Torque and Open MPI, respectively:

./configure --enable-cgroups --with-hwloc-path=/usr/local --enable-autorun --prefix=/var/spool/torque 


./configure --prefix=/var/nfsshare/openmpi1.10.5-tm-3 --with-tm=/var/spool/torque/

Getting mpirun to recognize all cores on each node

Update 1

Answers (1)

Related Questions