Reputation: 406
I have a program written by someone else that uses OpenMP. I am running it on a cluster that uses Slurm as its job manager. Despite setting OMP_NUM_THREADS=72 and properly requesting 72 cores for the job, the job is only using four cores.
I have already used scontrol show job <job_id> --details
to verify that there are 72 cores assigned to the job. I have also remoted into the node that the job is running on and used htop to inspect it. It was running 72 threads, all on four cores. It is worth noting that this is on an SMT4 power9 cpu, meaning that each physical core executes 4 simultaneous threads. Ultimately, it looks like openMP is putting all threads on one physical core. This is further complicated by the fact that this is an IBM system. I can't seem to find any useful documentation on more fine control of the openMP environment. Everything I find is for Intel.
I have also tried using taskset to manually change the affinity. This worked as intended and moved one of the threads to an unused core. The program continued to work as intended after this.
I could theoretically write a script to find all of the threads and call taskset to assign them to cores in a logical way, but I am afraid to do this. It seems like a bad idea to me. It would also take a while.
I guess my main question would be, is this a Slurm problem, an openMP problem, an IBM problem or a user error? Is there some environment variable I don't know about that I need to set? Will it break Slurm if I manually call taskset using a script? I would use scontrol to figure out which cpus are assigned to the job if I did that. I don't want to anger the people who run the cluster by messing things up though.
Here is the submission script. I can't include any of the actual running code due to license issues though. I'm hoping this will just be a simple matter of fixing an environment variable. The MPI_OPTIONS variables were recommended by the guy who administers the system. If by some chance someone here has worked with the ENKI cluster before, that's where this is running.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module load openmpi/3.1.3/2019
module load pgi/2019
export OMP_NUM_THREADS=72
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
Upvotes: 1
Views: 734
Reputation: 406
Edit: Fix resulted in a 7x speedup when using 72 cores, vs. just running on 4 cores. Considering the nature of the calculations being run, this is pretty good.
Edit 2: Fix resulted in a 17x speedup when using 160 vs. just running on 4 cores.
This might not work for everyone, but I have a really hacky solution. I wrote a python script that uses psutil to find all threads that are children of the running process and set their affinity manually. This script uses scontrol to figure out which cpus are assigned to the job and uses taskset to force the threads to distribute across those cpus.
So far the process is running a lot faster. I'm sure that forcing CPU affinity isn't the best way to do it, but its a lot better than not using the available resources at all.
Here is the basic idea behind the code. The program I am running is called pgmc, hence the variable names. You will need to create an anaconda environment with psutil installed if you are running on a system like mine.
import psutil
import subprocess
import os
import sys
import time
# Gets the id for the current job.
def get_job_id():
return os.environ["SLURM_JOB_ID"]
# Returns a list of processors assigned to the job and the total number of cpus
# assigned to the job.
def get_proc_info():
run_str = 'scontrol show job %s --details'%get_job_id()
stdout = subprocess.getoutput(run_str)
id_spec = None
num_cpus = None
chunks = stdout.split(' ')
for chunk in chunks:
if chunk.lower().startswith("cpu_ids"):
id_spec = chunk.split('=')[1]
start, stop = id_spec.split('-')
id_spec = list(range(int(start), int(stop) + 1))
if chunk.lower().startswith('numcpus'):
num_cpus = int(chunk.split('=')[1])
if id_spec is not None and num_cpus is not None:
return id_spec, num_cpus
raise Exception("Couldn't find information about the allocated cpus.")
if __name__ == '__main__':
# Before we do anything, make sure that we can get the list of cpus
# assigned to the job. Once we have that, run the command line supplied.
cpus, cpu_count = get_proc_info()
if len(cpus) != cpu_count:
raise Exception("CPU list didn't match CPU count.")
# If we successefully got to here, run the command line.
program_name = ' '.join(sys.argv[1:])
pgmc = subprocess.Popen(sys.argv[1:])
time.sleep(10)
pid = [proc for proc in psutil.process_iter() if proc.name() == "your program name here"][0].pid
# Now that we have the pid of the pgmc process, we need to get all
# child threads of the process.
pgmc_proc = psutil.Process(pid)
pgmc_threads = list(pgmc_proc.threads())
# Now that we have a list of threads, we loop over available cores and
# assign threads to them. Once this is done, we wait for the process
# to complete.
while len(pgmc_threads) != 0:
for core_id in cpus:
if len(pgmc_threads) != 0:
thread_id = pgmc_threads[-1].id
pgmc_threads.remove(pgmc_threads[-1])
taskset_string = 'taskset -cp %i %i'%(core_id, thread_id)
print(taskset_string)
subprocess.getoutput(taskset_string)
else:
break
# All of the threads should now be assigned to a core.
# Wait for the process to exit.
pgmc.wait()
print("program terminated, exiting . . . ")
Here is the submission script used.
wrk_path=${PWD}
cat >slurm.sh <<!
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time 2:00:00
#SBATCH -o log
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=72
#SBATCH -c 72
#SBATCH -J Cu-A-E
#SBATCH -D $wrk_path
cd $wrk_path
module purge
module load openmpi/3.1.3/2019
module load pgi/2019
module load anaconda3
# This is the anaconda environment I created with psutil installed.
conda activate psutil-node
export OMP_NUM_THREADS=72
# The two MPI_OPTIONS lines are specific to this cluster if I'm not mistaken.
# You probably won't need them.
MPI_OPTIONS="--mca btl_openib_if_include mlx5_0"
MPI_OPTIONS="$MPI_OPTIONS --bind-to socket --map-by socket --report-bindings"
time python3 affinity_set.py mpirun $MPI_OPTIONS ~/bin/pgmc-enki > out.dat
!
sbatch slurm.sh
My main reason for including the submission script is to demonstrate how the python script is used. More specifically, you call it, with your real job as an argument.
Upvotes: 1