satishkumar432
satishkumar432

Reputation: 397

Triggering a slurm srun command inside a kubernetes pod

I am inside a kubernetes pod which has slurm client libraries, munge etc. installed and I have mounted the munge socket onto the pod from the host, so that auth to the controller is happening successfully.

When I do an sinfo, the communication to the controller happens fine and I am able to get the output from the slurm cluster, which is running outside the kubernetes cluster on a baremetal node. I believe that's because it does not send it for execution to the compute/worker nodes. Perhaps it just controller that sends back the response.

However when I try to do a "srun" or "sbatch" the command times out with errors as shown below. In this case, I believe this is sent to worker/compute node unlike "sinfo"

srun: error: Timed out waiting for job step to complete

When I try to run the pod with the hostnetwork, everything seems to be working fine so I believe cni within k8s may be causing an issue with the timeout or conflicting overlay IP address from the container(?).

The slurmd on worker/compute node which is running the executing the job is not able to send the response back to the srun command that is being run on the kubernetes pod..

I do see some existing solutions like SUNK from CW able to run a login node and submit the slurm commands, however in those cases the slurm cluster is also in kubernetes.. which is not in my case, its hosted separately on baremetal nodes.

I am not a slurm expert and do you happen to know if this is something that slurm supports in general ?

Thank you for your time!

Upvotes: 0

Views: 41

Answers (0)

Related Questions