Reputation: 397
I am inside a kubernetes pod which has slurm client libraries, munge etc. installed and I have mounted the munge socket onto the pod from the host, so that auth to the controller is happening successfully.
When I do an sinfo, the communication to the controller happens fine and I am able to get the output from the slurm cluster, which is running outside the kubernetes cluster on a baremetal node. I believe that's because it does not send it for execution to the compute/worker nodes. Perhaps it just controller that sends back the response.
However when I try to do a "srun" or "sbatch" the command times out with errors as shown below. In this case, I believe this is sent to worker/compute node unlike "sinfo"
srun: error: Timed out waiting for job step to complete
When I try to run the pod with the hostnetwork, everything seems to be working fine so I believe cni within k8s may be causing an issue with the timeout or conflicting overlay IP address from the container(?).
The slurmd on worker/compute node which is running the executing the job is not able to send the response back to the srun command that is being run on the kubernetes pod..
I do see some existing solutions like SUNK from CW able to run a login node and submit the slurm commands, however in those cases the slurm cluster is also in kubernetes.. which is not in my case, its hosted separately on baremetal nodes.
I am not a slurm expert and do you happen to know if this is something that slurm supports in general ?
Thank you for your time!
Upvotes: 0
Views: 41