Reputation: 5
I've been using a cluster of 200 nodes with 32 cores each for simulating stochastic processes.
I have to do around 10 000 simulations of the same system, so I am running the same simulation (with different RNG seeds) in 32 cores of one node until it does all the 10 000 simulations. (each simulation is completely independent of the others)
In doing so some of the simulations, depending on the seed, take much more time then the others and after some time I usually have the full node allocated to me but only with one core running (so I am unnecessarily occupying 31 cores).
in my sbatch script I have this:
# Specify the number of nodes(nodes=) and the number of cores per nodes(tasks-pernode=) to be used
#SBATCH -N 1
#SBATCH --ntasks-per-node=32
...
cat list.dat | parallel --colsep '\t' -j 32 ./main{} > "Results/A.out"
which runs the 32 ./main's at a time in the same node until all lines of list.dat are used (10 000 lines).
Is there a way to free this unused cores for other jobs? And is there a way for me to send this 32 jobs to random nodes, that is one job submission using a maximum of 32 cores in (potentially) different nodes (whatever is free at the moment)?
Thank you!
Upvotes: 1
Views: 219
Reputation: 59090
If the cluster is configured to share compute nodes between jobs, one option is to submit a 10 000-jobs job array. The submission script would look like this (untested):
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --array=1-10000
cat list.dat | sed -n "${SLURM_ARRAY_TASK_ID} p" | xargs -I{} ./main{} > "Results/A.out_${SLURM_ARRAY_TASK_ID}"
Every simulation would then be scheduled independently from one another and use all the free cores on the cluster, without leaving allocated but unused cores.
By contrast with submitting 10 000 independent jobs, the job array will allow you to manage all the jobs with a single command. Also, job arrays put much less burden on the scheduler than individual jobs.
If there is a limitation on the number of jobs that are allowed in a job array, you can simply pack multiple simulation in the same job, either sequentially, or in parallel like you are doing at the moment, but with maybe 8 cores or 12.
#SBATCH -N 1
#SBATCH --ntasks-per-node=12
#SBATCH --array=1-10000:100
cat list.dat | sed -n "${SLURM_ARRAY_TASK_ID},$((SLURM_ARRAY_TASK_ID+99)) p" | parallel --colsep '\t' -j 12 ./main{} > "Results/A.out_${SLURM_ARRAY_TASK_ID}"
Upvotes: 1