Reputation: 730
I am trying to set-up configuration file on Ubuntu 20.04. I have tried several thing and searched for errors on other websites (link1, link2, link3) and slurm-website as well. Another similar question on SO as well.
Given the following information about my computer, what is the minimum required information must be provided in slurm.conf file.
The general information for my computer;
RAM: 125.5 GB
CPU: 1-20 (Intel® Xeon(R) CPU E5-2687W v3 @ 3.10GHz × 20 )
Graphics: NVIDIA Corporation GP104 [GeForce GTX 1080] / NVIDIA Corporation
OS: Ubuntu 20.04.2 LTS 64 bit
and I want to have 2 nodes with 10 CPUs for each and 1 node for GPU.
I have tried the followings;
After configuration and running the followings;
>sudo systemctl restart slurmctld
with no error. But I got error witj slurmd.
> sudo systemctl restart slurmd
Error is as below;
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
if I run "systemctl status slurmd.service
"
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2021-06-06 21:47:26 CEST; 1min 14s ago
Docs: man:slurmd(8)
Process: 52710 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
Here is my configuration file slurm.conf generated by configurator_easy.html and saved in /etc/slurm-llnl/slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
SlurmctldHost=myhostname
#
AuthType=auth/menge
Epilog=/usr/local/slurm/epilog
Prolog=/usr/local/slurm/prolog
FirstJobId=0
InactiveLimit=120
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp
KillWait=30
MinJobAge=300
MaxJobCount=10000
#PluginDir=/usr/local/lib
ReturnToService=0
SlurmdPort=6818
SlurmctldPort=6817
SlurmdSpoolDir=/var/spool/slurmd.spool
StateSaveLocation=/var/spool/slurm-llnl/slurm.state
SwitchType=switch/none
TmpFS=/tmp
WaitTime=30
SlurmctldPidFile=/run/slurmctld.pid
SlurmdPidFile=/run/slurmd.pid
SlurmUser=slurm
SlurmdUser=root
TaskPlugin=task/affinity
#
# TIMERS
SlurmctldTimeout=120
SlurmdTimeout=300
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
# LOGGING AND ACCOUNTING
#AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
#JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile
#SlurmdDebug=info
#SlurmdLogFile=
#
# COMPUTE NODES
NodeName=Linux[1-32] State=UP
NodeName=DEFAULT State=UNKNOWN
PartitionName=Linux[1-32] Default=YES
Upvotes: 1
Views: 3858
Reputation: 36
I have Ubuntu 20.04 running on wsl and I was also struggling with setting up slurm as well. It looks like everything is running fine now. I am still a beginner..
I recommend you to really check the logs:
cat /var/log/slurmctld.log
cat /var/log/slurmd.log
In my case I had some permission issues and therefore had to make sure slurm related directories had to be owned by SlurmUser as defined in your config.
At first glance I see in your config the following lines which could cause the problem (if I compare the settings with mine):
Hope something of the above mentioned can help.
Regards
Edit: I also would refer to the following Post, which could be similar to yours, if you run your command with sudo.
Upvotes: 2