Reputation: 477
Because of the limitations in how Matlab will utilize resources on a computing cluster, I want to create several jobs, each of which uses all of the cores on a given node. I can use the --array option in conjunction with other parameters to make sure that I get each job on a separate node. However, for some reason the slurm schedule on our cluster is putting my jobs on nodes which are already in use, even though I'm trying to max out the cores on a given node using the -c option:
#SBATCH --array=1-2
#SBATCH -t 24:00:00
#SBATCH -n 1
#SBATCH -c 20
#SBATCH -N 1
#SBATCH --exclusive
#SBATCH --mem-per-cpu 4000
module add ~/matlab/2014a
srun matlab -nodisplay -r "myfun($SLURM_ARRAY_TASK_ID);quit"
Using the --exclusive option doesn't seem to change anything. I've been having the same problem with single tasks as well, and my workaround has been to check which nodes aren't in use and request those specifically with the --nodelist option. Is there a way to use --array in conjunction with --nodelist so that each job and node in the list are matched in one-to-one correspondence? Right now SLURM is trying to use all the nodes for each job.
Upvotes: 1
Views: 2503
Reputation: 59110
Three possibilities:
Either the nodes have ghost jobs running outside of Slurm's control either because of ill-terminated previous jobs, or because of unfair cluster usage by other users. As Slurm does not check the load of nodes before allocating them, you can face the situation you are describing.
Or, the Shared
parameter of slurm.conf
could be set to Force' to deny you the use of
--exclusive` and hyperthreading could be enabled, leading Slurm to consider it has 40 cpus per node
Or the Shared
parameter of slurm.conf
could be set to something else than Exclusive
while the nodes are in two distinct partitions, a configuration that leads to node over-subscription.
Use the scontrol show config
command to get more information about the configuration.
Upvotes: 1