khituras
khituras

Reputation: 1091

How to submit parallel job steps with SLURM?

I have the following SLURM job script named gzip2zipslurm.sh:

#!/bin/bash
#SBATCH --mem 70G
#SBATCH --ntasks 4
echo "Task 1"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.A-B.xml.tar.gz  &
echo "Task 2"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.C-H.xml.tar.gz  &
echo "Task 3"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.I-N.xml.tar.gz  &
echo "Task 4"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.O-Z.xml.tar.gz  &
echo "Waiting for job steps to end"
wait
echo "Script complete"

I submit it to SLURM by sbatch gzip2zipslurm.sh. When I do, the output of the SLURM log file is

Task 1
Task 2
Task 3
Task 4
Waiting for job steps to end

The tar2zip program reads the given tar.gz file an re-packages it as a ZIP file.

The Problem: Only one CPU (out of 16 available on an idle node) is doing any work. With top I can see that all in all 5 srun commands have been started (4 for my tasks and 1 implicit for the sbatch job, I guess) but there is only one Java process. I can also see it on the files being worked on, only one is written.

How do I manage that all 4 tasks are actually executed in parallel?

Thanks for any hints!

Upvotes: 1

Views: 2606

Answers (1)

damienfrancois
damienfrancois

Reputation: 59110

The issue might be with the memory reservation. In the submission script, you set --mem=70GB, that is the global memory usage of the job.

When srun is used within a submission script, it inherits parameters from sbatch, including the --mem=70GB. So you actually implicitly run the following.

srun --mem 70G -n1 java -Xmx10g -jar ...

Try explicitly stating the memory to 70GB/4 with:

srun --mem 17G -n1 java -Xmx10g -jar ...

Also, as per the documentation, you should use --exclusive with srun in such a context.

srun --exclusive --mem 17G -n1 java -Xmx10g -jar ...

This option can also be used when initiating more than one job step within an existing resource allocation, where you want separate processors to be dedicated to each job step. If sufficient processors are not available to initiate the job step, it will be deferred. This can be thought of as providing a mechanism for resource management to the job within it's allocation.

Upvotes: 3

Related Questions