Reputation: 55
I am a non-paying user on a computing cluster that uses SLURM.
Occasionally, I've had long-running and multiple jobs that clogged up the squeue for paying users. Due to this I've had jobs cancelled by admin. Currently I've had a cap on the number of nodes that are available to me. While I dont argue with the equity of this arrangement , this is a problem for me in terms of getting work done, especially because I see free nodes that are not running any jobs, while I just sit waiting for jobs to pass through the node cap....
With that as background info, here are my two questions:
Isnt it possible for admin to suspend, and then resume jobs - either a job, or all jobs of a user, or a set of jobs? Is this suspend / resume onerous from the admin's perspective?
I suppose it should be possible to create a list of paying Vs non-paying users. And when paying username submits with sbatch to automatically instruct SLURM to suspend non-paying username's job or jobs, and resume when paid user's jobs have completed. Is this even possible? IF yes, is it outside the skill scope of regular SLURM / Farm admins?
Could someone please suggest any other solutions (if what I have asked above are unreasonable or absurd)?
Thank you!
Upvotes: 1
Views: 261
Reputation: 59360
The admin can run scontrol suspend jobid
and then scontrol resume jobid
The keywords here are 'QOS' and 'preemption'. Typically a QOS is created for the paying users, that has preemptive rights over the normal QOS. Jobs of the non-paying users can be cancelled, checkpointed, requeued, or suspended.
Upvotes: 0