Reputation: 1166
We set all our SLURM nodes to "drain" in preparation for maintenance windows, after which all new jobs stay pending until the nodes resume. We do this well before the maintenance window though, so all running jobs can finish. That wastes quite a bit of cluster time. Is there a way to specify that nodes will only accept batch jobs with a --time=x
argument such that the job start time + x
would be less than a given time? For example, if maintenance outage is schedule for Friday night, jobs reaching the top of the queue on Wednesday with --time=2-0
would run, but jobs submitted on Thursday with --time=2-0
would not.
Upvotes: 1
Views: 623
Reputation: 5357
You should probably create a reservation of all the nodes. The following command (untested) should do the trick
scontrol create reservation reservationname="maintenance1" start=03/31T08:00 Duration=10-00 Nodes=ALL Users=root
This will create a reservation for all the nodes only usable by root starting on March 31st for 10 days. This is also good practice as once the maintenance is finished you can submit some jobs to test that the cluster is working as expected.
You can remove a reservation with:
scontrol remove reservationname="maintenance1"
Upvotes: 3