Reputation: 20415
Using sinfo
it shows 3 nodes are in drain
state,
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 3 drain node[10,11,12]
Which command line should I use to undrain such nodes?
Upvotes: 44
Views: 119649
Reputation: 734
While there is already an approved answer, I would like to mention that going through:
scontrol: update NodeName=nodename State=DOWN Reason="undraining"
scontrol: update NodeName=nodename State=RESUME
returns slurm_update error: Invalid node state specified
for SLURM 21.08.03 on EndeavourOS 2021.08.27. The solution that worked for me is:
scontrol: update NodeName=nodename State=UNDRAIN
Without need to set the node DOWN
Upvotes: 5
Reputation: 41
The other reason a node is in the DRAIN state is if the facts about the system do not match those declared in the /etc/slurm/slurm.conf file. For example, if the slurm.conf file declares that a node has 4 GPUs, but the slurm daemon only finds 3 of them, it will mark the node as "drain" because of the mismatch. Or if the node is declared in slurm.conf to have 128G of memory, and the slurm daemon only finds 96G, it will also set the state to "drain".
The reason code for mismatches is displayed by the 'scontrol show node ' command as the last line of output.
Upvotes: 4
Reputation: 5077
If no jobs are currently running on the node:
scontrol update nodename=node10 state=idle
If jobs are running on the node:
scontrol update nodename=node10 state=resume
Upvotes: 34
Reputation: 316
If you set it to down all jobs will be killed.
Set the node to RESUME instead.
Upvotes: 16
Reputation: 20415
Found an approach, enter scontrol interpreter (in command line type scontrol
) and then
scontrol: update NodeName=node10 State=DOWN Reason="undraining"
scontrol: update NodeName=node10 State=RESUME
Then
scontrol: show node node10
displays amongst other info
State=IDLE
Update: some of these nodes got DRAIN state back; noticed their root partition was full after e.g. show node a10
which showed Reason=SlurmdSpoolDir is full
, thus in Ubuntu sudo apt-get clean
to remove /var/cache/apt
contents and also gzipped some /var/log
files.
Upvotes: 44