Reputation: 1244
I am trying to setup my ray
cluster with a sbatch script.
I am starting head & worker nodes as steps in my script.
The worker nodes are expected to keep running till the job is alive.
#!/usr/bin/env bash
#SBATCH -A account --partition pp --nodes 3 --ntasks-per-node 1 --cpus 64
read -r -d '\n' -a node_names < <(scontrol show hostnames "$SLURM_JOB_NODELIST")
read -r -d '\n' -a node_ips < <(
printf '%s\n' "${node_names[@]}" |
xargs -I {} srun -J "get-ip" --nodes=1 --ntasks=1 -w {} hostname --ip-address
)
# start head-node
srun -J 'ray-head-node' -n1 -N1 -c1 -w "${node_names[0]}" \
ray_helper.bash --start-head --ip "${node_ips[0]}" &
# start worker-nodes
for nname in "${node_names[@]:1}"; do
srun -J 'ray-worker-node' -n1 -N1 -c1 -w "$nname" \
ray_helper.bash --start-worker --ip "${node_ips[0]}" &
sleep 60
done
sleep 60
# Continue with executing compute on ray cluster.
...
## ray_helper.bash
start_cluster() { ... }
start_cluster "$@"
When the subcommands (via srun
) fails the rest of script becomes void.
I am aware of setting up trap for capturing job failures via --signal
, but,
How can I trap
step failure in general ?
How to implement different trap
's for group of step
's
e.g. different handlers for ray-head-node
& ray-worker-node
.
Upvotes: 0
Views: 28