Reputation: 7819
All my slurm jobs fail with exit code 0:53
within two seconds of starting.
When I look at job details with scontrol show jobid <JOBID>
it doesn't say anything suspicious.
When I look at the files that stdout
and stderr
write to, there is nothing there.
I couldn't find anything on the listed signal 53
.
Upvotes: 7
Views: 4686
Reputation: 56
I can add that this error is always tied to directory access for writing the files by SLURM. I had the same problem; I've found that sbatch
doesn't throw a proper error.
You can replace sbatch
with srun
to see the exact error every time.
Upvotes: 2
Reputation: 67
I've had the same issue as the OP and in my case the log directory existed, however, was on a filesystem that was read-only. To cite the entry from the ZIH HPC Compendium
When redirecting stderr and stderr into a file using --output= and --stderr=, make sure the target path is writeable on the compute nodes, i.e., it may not point to a read-only mounted filesystem like /projects.
https://compendium.hpc.tu-dresden.de/jobs_and_resources/slurm/
Upvotes: 2
Reputation: 31
I wanted to add that while this error has happened to me if the directory does not exist, the same thing happens if you exceed your quota.
Upvotes: 3
Reputation: 7819
It turns out that the directory containing the files that slurm was supposed to write stdout and stderr to didn't exist.
In my submit.sh
script, the relevant lines were:
#SBATCH --output=log/%j.out # where to store the output ( %j is the JOBID )
#SBATCH --error=log/%j.err # where to store error messages
The log
directory in the current working directory from which I was submitting the job didn't exist. Once I created the directory slurm jobs no longer failed with 0:53
.
My slurm version is 22.05.2
. Per this answer, slurm no longer errors silently when the output directory doesn't exist from version 23.02
upwards. Seems to have been reported in this issue.
Upvotes: 5