Reputation: 1091
I have the following SLURM job script named gzip2zipslurm.sh
:
#!/bin/bash
#SBATCH --mem 70G
#SBATCH --ntasks 4
echo "Task 1"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.A-B.xml.tar.gz &
echo "Task 2"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.C-H.xml.tar.gz &
echo "Task 3"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.I-N.xml.tar.gz &
echo "Task 4"
srun -n1 java -Xmx10g -jar tar2zip-1.0.0-jar-with-dependencies.jar articles.O-Z.xml.tar.gz &
echo "Waiting for job steps to end"
wait
echo "Script complete"
I submit it to SLURM by sbatch gzip2zipslurm.sh
.
When I do, the output of the SLURM log file is
Task 1
Task 2
Task 3
Task 4
Waiting for job steps to end
The tar2zip
program reads the given tar.gz
file an re-packages it as a ZIP
file.
The Problem: Only one CPU (out of 16 available on an idle node) is doing any work. With top
I can see that all in all 5 srun
commands have been started (4 for my tasks and 1 implicit for the sbatch job, I guess) but there is only one Java process. I can also see it on the files being worked on, only one is written.
How do I manage that all 4 tasks are actually executed in parallel?
Thanks for any hints!
Upvotes: 1
Views: 2606
Reputation: 59110
The issue might be with the memory reservation. In the submission script, you set --mem=70GB
, that is the global memory usage of the job.
When srun
is used within a submission script, it inherits parameters from sbatch
, including the --mem=70GB
. So you actually implicitly run the following.
srun --mem 70G -n1 java -Xmx10g -jar ...
Try explicitly stating the memory to 70GB/4 with:
srun --mem 17G -n1 java -Xmx10g -jar ...
Also, as per the documentation, you should use --exclusive
with srun
in such a context.
srun --exclusive --mem 17G -n1 java -Xmx10g -jar ...
This option can also be used when initiating more than one job step within an existing resource allocation, where you want separate processors to be dedicated to each job step. If sufficient processors are not available to initiate the job step, it will be deferred. This can be thought of as providing a mechanism for resource management to the job within it's allocation.
Upvotes: 3