Reputation: 31
I am attempting to create my own computer cluster (perhaps a Beowulf, though throwing around that term willy nilly apparently isn't cool) and have installed Slurm as my scheduler. Everything appears fine upon inputting sinfo
danny@danny5:~/Cluster/test$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 5 idle danny[1-5]
danny@danny5:~/Cluster/test$
However if I try and submit a job using the following script
danny@danny5:~/Cluster/test$ cat script.sh
#!/bin/bash -l
#SBATCH --job-name=JOBNUMBA0NE
#SBATCH --time=00-00:01:00
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100
#SBATCH -o stdout
#SBATCH -e stderr
#SBATCH --mail-type=END
#SBATCH --mail-user=dkweiss@wesleyan.edu
gfortran -O3 -i8 0-hc1.f
./a.out
I receive a lovely Submitted batch job 6
, however nothing appears in squeue
, and none of the expected output files materialize (the executable a.out file doesn't even appear). I will attach the associated info for scontrol show partition
:
danny@danny5:~/Cluster/test$ scontrol show partition
PartitionName=debug
AllocNodes=ALL AllowGroups=ALL Default=YES
DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=1 MaxCPUsPerNode=UNLIMITED
Nodes=danny[1-5]
Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
State=UP TotalCPUs=8 TotalNodes=5 SelectTypeParameters=N/A
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Any ideas?
Upvotes: 3
Views: 10980
Reputation: 61
This happened to me when the log folder did not exist (had not been created beforehand). Slurm does not automatically handle directory creation for you
Upvotes: 5
Reputation: 109
I had the same problem, I suppose there could be more reasons why jobs just disappear without any feedback, but in my case slurm simply missed privileges. Therefore:
sbatch
with sudo
, if it succeed this is probably the same issue.Upvotes: 3
Reputation: 59330
I have seen that behaviour when the user submitting the job (here danny
) does not exist with the same UID on the compute nodes. Make sure id danny
reports the same output on all Slurm-related nodes. You should look for confirmation in the compute node's slurm log file.
Upvotes: 2