Reputation: 137
I have set up two nodes for MPI, aml1 (master) and aml2 (worker). I am trying to use mpirun with R scripts and using Rmpi and doMPI libraries. The specs for both machines are the same:
On RHEL 7.3
# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
Stepping: 7
CPU MHz: 2900.000
BogoMIPS: 5790.14
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
If you care to see hwloc lstopo output.
I am using OpenMPI 1.10.5 and I can see processes running on aml1 and aml2. However, I don't see my test script running any faster when I increase the number of workers that are spawned from mpirun and thus I don't see any decrease in computation time. This makes me assume that mpirun isn't properly detecting how many cores are available, or that I am assigning it incorrectly in the hostfile or rankfile.
If I change my hostfile or rankfile to different values of slots:
$ cat hosts
aml1 slots=4 max_slots=8 #I can change this to 10 slots
aml2 slots=4
$ cat rankfile
rank 0=aml1 slot=0:0
rank 1=aml1 slot=0:1
rank 2=aml1 slot=0:2
rank 3=aml1 slot=0:3
rank 4=aml2 slot=0:6
rank 5=aml2 slot=0:7 #I can add more ranks
And then I run:
$ mpirun -np 1 --hostfile hosts --rankfile rankfile R --slave -f example7.R
$ cat example7.R
library(doMPI)
cl <- startMPIcluster(verbose=TRUE)
registerDoMPI(cl)
system.time(x <- foreach(seed=c(7, 11, 13), .combine="cbind") %dopar% {
set.seed(seed)
rnorm(90000000)
})
closeCluster(cl)
mpi.quit(save="no")
I still get the similar elapsed system times:
Spawning 5 workers using the command:
5 slaves are spawned successfully. 0 failed.
user system elapsed
9.023 7.396 16.420
Spawning 25 workers using the command:
25 slaves are spawned successfully. 0 failed.
user system elapsed
4.752 8.755 13.508
I've also tried setting up Torque and building openmpi with the tm configure option, but I'm having separate issues with that. I believe I don't necessary need to use Torque to accomplish what I want to do, but please confirm if I am incorrect.
What I want to do is run an R script with Rmpi and doMPI. The R script itself should only be run once, with a section of code spawned out to the cluster. I want to maximize the cores available on both nodes (aml,aml2).
Appreciate any help from the community!
Here's a bit more detail: I run the following, changing the hostfile for each run:
$ mpirun -np 1 --hostfile hosts [using --map-by slot or node] R --slave -f example7.R
+----------------+-----------------+-----------------+
| | //--map-by node | //--map-by slot |
+----------------+-----------------+-----------------+
| slots per host | time | time |
| 2 | 24.1 | 24.109 |
| 4 | 18 | 12.605 |
| 4 | 18.131 | 12.051 |
| 6 | 18.809 | 12.682 |
| 6 | 19.027 | 12.69 |
| 8 | 18.982 | 12.82 |
| 8 | 18.627 | 12.76 |
+----------------+-----------------+-----------------+
Should I be getting reduced times? Or is this as good as it gets? I feel like I should be able to increase my slots per host to 30 for peak performance, but it peaks around 4 slots per host.
Upvotes: 2
Views: 1398
Reputation: 137
I think I found an answer to my own question.
Since I am new to this, I was under the assumption that Torque would automatically use all the "cores" available on a machine/node. Since I have 32 cores, I was expecting 32 workers spawning per node. But actually, there are 16 physical cores, with each of those 16 cores having hyperthreading, which makes 16x2 cores available on a machine. From my understanding Torque launches one process per processor (or physical core in this example). So I shouldn't be expecting 32 workers to be spawned per node.
I reviewed more information about NUMA support, and per Open MPI FAQ, RHEL typically requires numactl-devel packages to be installed prior to build to support memory affinity. So I did that for each node, and I am actually able to run an R script through Torque, defining 8 cores, or 16 cores per node. Now the computation times are quite similar. If I increase to 18/20 cores per node, then performance drops as expected.
Below are my .configure options for Torque and Open MPI, respectively:
./configure --enable-cgroups --with-hwloc-path=/usr/local --enable-autorun --prefix=/var/spool/torque
./configure --prefix=/var/nfsshare/openmpi1.10.5-tm-3 --with-tm=/var/spool/torque/
Upvotes: 0