Reputation: 524
I'm running simulations on a machine that uses SLURM. The maximum wall time I can set is 24 hours, but my simulations will take much longer (approx. 1 week or so). I know that, in principle, I could put in hold a job that restarts my simulation just after the previous one has ended by simply running sbatch --dependency=afterok:xxxxxxxx batch_file
. My problem is that if my simulation is killed because of the wall time, the afterok
dependency will return me a DependencyNeverSatisfied
error, and the reason why this happens is explicitly stated in the SLURM documentation:
afterok:job_id[:jobid...]
This job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero).
If the machine is killing my job because the simulation time exceeds wall time, the job won't end with exit code zero (at least, this has been my experience so far). Unfortunately, I cannot set the time of my simulations efficiently so that they end exactly within 24 hours. So here's my question. Is there a way in which I can tell SLURM "Start job xxx
only after the one on which you depend, job yyy
, was killed only because its execution time exceeded wall time"? Something like an afterwalltime
flag, if you see what I mean. I want to specify that afterany
is not an option because it might lead to potentially dangerous behaviors (simulations might try to restart even if some error occurred and corrupt the output files).
Upvotes: 1
Views: 1173
Reputation: 5762
One of the options is the afternok
/afterany
options, but you already discarded them due to problems with other job endinc causes. But you could add a check in the beginning of the script to see if the files are OK and, only in that case, continue.
The most common way to deal with this situations is to prepare and run a simulation that you expect to last 23h, ask for a wall-time of 24 and launch as many jobs as needed (linked by afterok
dependencies) to get your final results.
Upvotes: 2