SHB11
SHB11

Reputation: 375

Batch script for multi-partition job?

I'm working on a project which runs programs on two different partitions of a large compute cluster. I'd like to run this using a batch script, but after searching, it's still unclear if/how I can allocate and run programs on two different partitions from within a single batch script. Here's the sort of thing I'd like to do

#!/bin/bash
#SBATCH --partition=<WHAT GOES HERE? I want to perform 100 processes on partition "batch" and 1 process on partition "gpu". I will alternate between the 2 during my jobs execution>
#SBATCH --ntasks=<100 on batch, 1 on gpu>
#SBATCH --mem-per-cpu=2G
#SBATCH --time=4-00:00:00
#SBATCH --exclude=nodeynode[003,016,019,020-023,026-030,004-015,017-018,020,024,031]
#SBATCH --job-name="lorem_ipsum"

filenames=("name1" "name2" "name3")

srun -p gpu python gpu_init.py
wait

for i in {0..100}
do
    for name in "${filenames[@]}"
    do
    srun -p batch pythonexecutable &
    done
srun -p gpu python gpu_iter.py
wait
done

Apologies for bash errors, I usually script in python but I can't here as I'm switching python modules (different versions) within my bash script (not shown). I saw that you can actually put a list of partitions in the header of a batch script, but from what I read that actually just tells the scheduler to allocate any available partitions from within the list, not multiple partitions.

Thanks!

Upvotes: 4

Views: 5179

Answers (1)

damienfrancois
damienfrancois

Reputation: 59180

Slurm jobs are restricted to one partition so in your case, there are several courses of action:

  • submitting two job arrays --array=1..100 and splitting your submission script in one part for the batch partition and another part for the gpu partition and linking both arrays with --depedendcy=aftercorr:<job_id of the 'batch' job array>

  • use salloc to create an allocation on the gpu partition, and then use SSH explicitly to that node to run python gpu_iter.py in the submission script (if the cluster configuration permits)

  • modify the gpu_iter.py so that it can be signaled (with UNIX signals) that it has to run and then sleep until the next signal, and use scancel to signal the gpu job from within the batch job at each iteration.

Update: according to this ticket, this can be done now with heterogeneous jobs.

Upvotes: 4

Related Questions