Reputation: 3014
As part of a SLURM + Snakemake pipeline, I have a script that launches a database server as a SLURM job and another one that stops it, through scancel
.
Probably an HPC batch system shouldn't be used this way, but at the moment SLURM nodes are the only resources I have to give this database the quantity of RAM and CPUs I need. The corresponding job is temporary anyway (it runs as long as the pipeline needs it and then stops with everything else).
My question is if this use case should be managed in a more appropriate/idiomatic way. In particular, a user of my scripts just told me that they see failed jobs in the acct
output. These are the server jobs, which apparently, result as failed due to the scancel
interruption. Clearly, that's a false positive and I'd like to avoid it.
Upvotes: 0
Views: 259
Reputation: 59260
Jobs that are cancelled do not necessarily appear as FAILED
in the accounting. They will appear as FAILED
is the process that is interrupted by the scancel
command exits with a non-zero return code.
You should confirm that in the documentation of the database server but it seems that, upon receiving the SIGINT signal, the database server considers that to be a wrongful termination and exits with a non-zero code.
In that case, you should catch the SIGINT signal with the trap
Bash function and run the proper command to shutdown the server in that function.
Upvotes: 2