Cornelius Roemer
Cornelius Roemer

Reputation: 7819

All slurm jobs fail silently with exit code 0:53

All my slurm jobs fail with exit code 0:53 within two seconds of starting.

When I look at job details with scontrol show jobid <JOBID> it doesn't say anything suspicious.

When I look at the files that stdout and stderr write to, there is nothing there.

I couldn't find anything on the listed signal 53.

Upvotes: 7

Views: 4686

Answers (4)

holegar
holegar

Reputation: 56

I can add that this error is always tied to directory access for writing the files by SLURM. I had the same problem; I've found that sbatch doesn't throw a proper error. You can replace sbatch with srun to see the exact error every time.

Upvotes: 2

Brain Damage
Brain Damage

Reputation: 67

I've had the same issue as the OP and in my case the log directory existed, however, was on a filesystem that was read-only. To cite the entry from the ZIH HPC Compendium

When redirecting stderr and stderr into a file using --output= and --stderr=, make sure the target path is writeable on the compute nodes, i.e., it may not point to a read-only mounted filesystem like /projects.

https://compendium.hpc.tu-dresden.de/jobs_and_resources/slurm/

Upvotes: 2

I wanted to add that while this error has happened to me if the directory does not exist, the same thing happens if you exceed your quota.

Upvotes: 3

Cornelius Roemer
Cornelius Roemer

Reputation: 7819

It turns out that the directory containing the files that slurm was supposed to write stdout and stderr to didn't exist.

In my submit.sh script, the relevant lines were:

#SBATCH --output=log/%j.out                 # where to store the output ( %j is the JOBID )
#SBATCH --error=log/%j.err                  # where to store error messages

The log directory in the current working directory from which I was submitting the job didn't exist. Once I created the directory slurm jobs no longer failed with 0:53.

My slurm version is 22.05.2. Per this answer, slurm no longer errors silently when the output directory doesn't exist from version 23.02 upwards. Seems to have been reported in this issue.

Upvotes: 5

Related Questions