Reputation: 21
Batch script to run many serial jobs parallel on a HPC with slurm
I want to run a large number of independent serial jobs in parallel using slurm. However, I run into the maximum number of 100 jobs that a user can submit. Therefore only 100 jobs are processed simultaneously in my script.
Is there a better way that I can submit the complete simulation as one big job?
#!/bin/bash
max_jobs=100
# Set the directory where the simulation folders are located
dir="/work/parameter_study/"
# Loop over the parameter cases
for param_case in {0001..0216}_sim; do
cd $dir/$param_case
# Loop over the Monte Carlo simulations
for mcs_case in {0001..1500}_MCS; do
cd $dir/$param_case/$mcs_case
#sed -i -e 's/\r$//' a.out
chmod 777 a.out
# Check if max_jobs is exceeded
while true
do
# Count rows without header
job_count=$(squeue -h -t PD,R | wc -l)
if [ $job_count -lt $max_jobs ]
then
break
fi
sleep 0.5
done
# Submit a job for each simulation using the a.out file
jobID=$(sbatch -p single -J ${param_case}_${msc_case} --wrap ./a.out)
echo "${jobID} ${param_case} ${mcs_case} - $(date '+%H:%M:%S')"
done
done
# Wait for all jobs to finish
wait
Upvotes: 1
Views: 1000
Reputation: 3530
So you have 300000 individual jobs. So to execute in parallel, I am assuming you can run 40 tasks per node at the same time. So you need 7500 nodes to run your all tasks the same time. It is unrealistic to get such huge allocation based on your cluster.
So, I would recommend consider the following. Based on your job waiting time, scheduling, you have to come up with the ideal number of nodes you can request per job submission to get a decent job waiting time. If it is N, then run as many jobs (sbatch submission) as the following.
Total_job_submissions= #number of tasks/( #max-tasks-per-node * N )
#number of tasks and #max-tasks-per-node varies based on your job queue.
I would recommend you to look into job arrays also.
To run your jobs in parallel in the batch script provided, you just need to do the following.
for (( run=$START_NUM; run<=END_NUM; run++ )); do
echo This is SLURM task $SLURM_ARRAY_TASK_ID, run number $run
mcs_number=$((run - (-1 + SLURM_ARRAY_TASK_ID) * PER_TASK))
param_case=$(printf "%04d" $SLURM_ARRAY_TASK_ID)_sim
mcs_case=$(printf "%04d" $mcs_number)_MCS
mcs_dir=$dir$param_case/$mcs_case
cd $mcs_dir
chmod 777 a.out
srun -n 1 ./a.out &
done
wait
Adding &
will make srun
run a.out
in background and multiple instances of srun will be run in parallel. wait
command at the end ensures that all jobs are finished before exiting from the script.
Upvotes: 1
Reputation: 21
Thats my batch script to run a job array. I can call this with:
sbatch -p single array.sh
Each array starts 100 jobs, in which 1500 calculations are executed one after the other. Is there a way to execute these 1500 single a.out not serially but in parallel?
#!/bin/sh
#SBATCH --job-name=mega_array # Job name
#SBATCH --nodes=1 # Use one node
#SBATCH --ntasks=1 # Run a single task
#SBATCH --mem-per-cpu=1gb # Memory per processor
#SBATCH --time=14:00:00 # Time limit hrs:min:sec
#SBATCH --array=1-100 # Array range
pwd; hostname; date
PER_TASK=1500
START_NUM=$(( ($SLURM_ARRAY_TASK_ID - 1) * $PER_TASK + 1 ))
END_NUM=$(( $SLURM_ARRAY_TASK_ID * $PER_TASK ))
echo This is task $SLURM_ARRAY_TASK_ID, which will do runs $START_NUM to $END_NUM
dir="/work/"
for (( run=$START_NUM; run<=END_NUM; run++ )); do
echo This is SLURM task $SLURM_ARRAY_TASK_ID, run number $run
mcs_number=$((run - (-1 + SLURM_ARRAY_TASK_ID) * PER_TASK))
param_case=$(printf "%04d" $SLURM_ARRAY_TASK_ID)_sim
mcs_case=$(printf "%04d" $mcs_number)_MCS
mcs_dir=$dir$param_case/$mcs_case
cd $mcs_dir
chmod 777 a.out
./a.out
done
Upvotes: 1