Reputation: 300
I don't see why my config is being ignored, even when specifying -f
directly. Google yields no results, is there any relevant documentation I can look at for this?
Hopefully I just completely missed some critical information for this
after starting slurmctl daemon on one machine, attempting to run sudo slurmd -f /usr/local/etc/slurm.conf -D -vvvvvvv
(for testing) gives output (relevant excerpt) (note RealMemory = 3907
):
slurmd: debug3: Confile = `/usr/local/etc/slurm.conf'
slurmd: debug3: Debug = 3
slurmd: debug3: CPUs = 2 (CF: 2, HW: 2)
slurmd: debug3: Boards = 1 (CF: 1, HW: 1)
slurmd: debug3: Sockets = 2 (CF: 1, HW: 2)
slurmd: debug3: Cores = 1 (CF: 2, HW: 1)
slurmd: debug3: Threads = 1 (CF: 1, HW: 1)
slurmd: debug3: UpTime = 8838 = 02:27:18
slurmd: debug3: Block Map = 0,1
slurmd: debug3: Inverse Map = 0,1
slurmd: debug3: RealMemory = 3907
slurmd: debug3: TmpDisk = 19018
slurmd: debug3: Epilog = `(null)'
slurmd: debug3: Logfile = `/var/log/slurmd.log'
slurmd: debug3: HealthCheck = `(null)'
slurmd: debug3: NodeName = node1
slurmd: debug3: Port = 6818
slurmd: debug3: Prolog = `(null)'
slurmd: debug3: TmpFS = `/tmp'
slurmd: debug3: Public Cert = `(null)'
slurmd: debug3: Slurmstepd = `/usr/local/sbin/slurmstepd'
slurmd: debug3: Spool Dir = `/var/spool/slurmd'
slurmd: debug3: Syslog Debug = 10
slurmd: debug3: Pid File = `/var/run/slurm/slurmd.pid'
slurmd: debug3: Slurm UID = 64030
slurmd: debug3: TaskProlog = `(null)'
slurmd: debug3: TaskEpilog = `(null)'
slurmd: debug3: TaskPluginParam = 0
slurmd: debug3: UsePAM = 0
ctld spams
slurmctld: debug2: Processing RPC: MESSAGE_NODE_REGISTRATION_STATUS from UID=0
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
slurm.conf
output from cat /usr/local/etc/slurm.conf | grep -v "#"
(Note RealMemory=2000000
, amongst other ignored configuration details):
ClusterName=scluster_0
SlurmctldHost=controller
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=0
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
AccountingStorageType=accounting_storage/none
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
NodeName=node[1-2] CPUs=2 RealMemory=2000000 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN
PartitionName=pdefault Nodes=ALL Default=YES MaxTime=INFINITE State=UP
The configuration of both systems (slurmctl daemon, and slurm daemon) is identical
I also have cgroup_allowed_devices.conf
& cgroup.conf
if those would be relevant
Upvotes: 0
Views: 822
Reputation: 59072
@Marcus Boden is correct.
The RealMemory = 3907
from the slurmd
output is what Slurm discovers on the server, not what it reads from the documentation.
It finds there 3907MB of RAM and compares it to the 2000000 it finds in the configuration file and complains that
slurmctld: debug: Node node1 has low real_memory size (3907 < 2000000)
so, basically, that it finds 4GB of RAM while it expected to find 2TB based on the configuration.
You should check on the server the exact amount of memory Linux finds with the free
command and make sure it matches the specification you believe it to have.
See more information here for instance.
Upvotes: 1
Reputation: 1685
My guess is the followong: The slurmd ist reading the config file correctly. What happens is that Slurm cross-checks the configuration with the actual detected hardware. It notices it should have 2000000 RealMemory, according to the config, but only finds 3907 when looking at the hardware. This mismatch is reported and the node drained.
This behaviour makes sure you don't have faulty DIMM in your server without noticing.
Upvotes: 2