Slurm step failure capture via trap

Question

I am trying to setup my ray cluster with a sbatch script.

I am starting head & worker nodes as steps in my script.

The worker nodes are expected to keep running till the job is alive.

mysbatch.bash

#!/usr/bin/env bash                                                                         
                                                                                            
#SBATCH -A account --partition pp --nodes 3 --ntasks-per-node 1 --cpus 64
                                                                                            
read -r -d '
' -a node_names < <(scontrol show hostnames "$SLURM_JOB_NODELIST")
read -r -d '
' -a node_ips < <(
     printf '%s
' "${node_names[@]}" |
          xargs -I {} srun -J "get-ip" --nodes=1 --ntasks=1 -w {} hostname --ip-address
)

# start head-node                                                                           
srun -J 'ray-head-node' -n1 -N1 -c1 -w "${node_names[0]}" \
    ray_helper.bash --start-head --ip "${node_ips[0]}" &
 
# start worker-nodes
for nname in "${node_names[@]:1}"; do
     srun -J 'ray-worker-node' -n1 -N1 -c1 -w "$nname" \
         ray_helper.bash --start-worker --ip "${node_ips[0]}" &
     sleep 60
done
sleep 60

# Continue with executing compute on ray cluster.
...

ray_helper.bash

## ray_helper.bash                                                                          
start_cluster() { ... }
start_cluster "$@"

When the subcommands (via srun) fails the rest of script becomes void.

I am aware of setting up trap for capturing job failures via --signal, but,

How can I trap step failure in general ?
How to implement different trap's for group of step's e.g. different handlers for ray-head-node & ray-worker-node.

Slurm step failure capture via trap

mysbatch.bash

ray_helper.bash

Answers (0)

Related Questions