Reputation: 21

Slurm - Execute a lot of serial jobs parallel

Batch script to run many serial jobs parallel on a HPC with slurm

I want to run a large number of independent serial jobs in parallel using slurm. However, I run into the maximum number of 100 jobs that a user can submit. Therefore only 100 jobs are processed simultaneously in my script.

Is there a better way that I can submit the complete simulation as one big job?

#!/bin/bash

max_jobs=100

# Set the directory where the simulation folders are located
dir="/work/parameter_study/"

# Loop over the parameter cases
for param_case in {0001..0216}_sim; do
    cd $dir/$param_case
    
    # Loop over the Monte Carlo simulations
    for mcs_case in {0001..1500}_MCS; do
        cd $dir/$param_case/$mcs_case
        
        #sed -i -e 's/\r$//' a.out
            chmod 777 a.out
        
        # Check if max_jobs is exceeded
        while true
        do
          # Count rows without header
          job_count=$(squeue -h -t PD,R | wc -l) 
  
          if [ $job_count -lt $max_jobs ]
          then
            break
          fi
  
          sleep 0.5
        done


        # Submit a job for each simulation using the a.out file
            jobID=$(sbatch -p single -J ${param_case}_${msc_case} --wrap ./a.out) 
        echo "${jobID} ${param_case} ${mcs_case} - $(date '+%H:%M:%S')"
        
    done
done

# Wait for all jobs to finish
wait

Upvotes: 1

Answers (2)

j23

Reputation: 3530

So you have 300000 individual jobs. So to execute in parallel, I am assuming you can run 40 tasks per node at the same time. So you need 7500 nodes to run your all tasks the same time. It is unrealistic to get such huge allocation based on your cluster.

So, I would recommend consider the following. Based on your job waiting time, scheduling, you have to come up with the ideal number of nodes you can request per job submission to get a decent job waiting time. If it is N, then run as many jobs (sbatch submission) as the following.

Total_job_submissions= #number of tasks/( #max-tasks-per-node * N )

#number of tasks and #max-tasks-per-node varies based on your job queue.

I would recommend you to look into job arrays also.

To run your jobs in parallel in the batch script provided, you just need to do the following.

for (( run=$START_NUM; run<=END_NUM; run++ )); do
  echo This is SLURM task $SLURM_ARRAY_TASK_ID, run number $run
  
  mcs_number=$((run - (-1 + SLURM_ARRAY_TASK_ID) * PER_TASK))
  
  param_case=$(printf "%04d" $SLURM_ARRAY_TASK_ID)_sim
  mcs_case=$(printf "%04d" $mcs_number)_MCS
  mcs_dir=$dir$param_case/$mcs_case
  cd $mcs_dir
  chmod 777 a.out
  srun -n 1 ./a.out &
  
done

wait

Adding & will make srun run a.out in background and multiple instances of srun will be run in parallel. wait command at the end ensures that all jobs are finished before exiting from the script.

Upvotes: 1

Sebastian Krug

Reputation: 21

Thats my batch script to run a job array. I can call this with:

sbatch -p single array.sh

Each array starts 100 jobs, in which 1500 calculations are executed one after the other. Is there a way to execute these 1500 single a.out not serially but in parallel?

#!/bin/sh
#SBATCH --job-name=mega_array       # Job name
#SBATCH --nodes=1                   # Use one node
#SBATCH --ntasks=1                  # Run a single task
#SBATCH --mem-per-cpu=1gb           # Memory per processor
#SBATCH --time=14:00:00             # Time limit hrs:min:sec
#SBATCH --array=1-100               # Array range

pwd; hostname; date

PER_TASK=1500

START_NUM=$(( ($SLURM_ARRAY_TASK_ID - 1) * $PER_TASK + 1 ))
END_NUM=$(( $SLURM_ARRAY_TASK_ID * $PER_TASK ))

echo This is task $SLURM_ARRAY_TASK_ID, which will do runs $START_NUM to $END_NUM

dir="/work/"

for (( run=$START_NUM; run<=END_NUM; run++ )); do
  echo This is SLURM task $SLURM_ARRAY_TASK_ID, run number $run
  
  mcs_number=$((run - (-1 + SLURM_ARRAY_TASK_ID) * PER_TASK))
  
  param_case=$(printf "%04d" $SLURM_ARRAY_TASK_ID)_sim
  mcs_case=$(printf "%04d" $mcs_number)_MCS
  mcs_dir=$dir$param_case/$mcs_case
  cd $mcs_dir
  chmod 777 a.out
  ./a.out 
  
done

Upvotes: 1

Slurm - Execute a lot of serial jobs parallel

Answers (2)

Related Questions