Reputation: 21

How to run a longer job in SLURM if the default time limit of partition is not sufficient?

I have submitted my job in a linux-cluster(that uses SLURM to schedule job), but the time limit of each partition is only 24hr(actually this limit is set by the admin) and it seems that my code need to run more than a week(as per my guess). I am new to SLURM script and understand a very little about the interplay between the following:

#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=

I am seeking the way out there to avoid the time limit while submitting job and run my complete job.

Suggestions are appreciated.

Upvotes: 2

Answers (2)

ofekp

Reputation: 525

For anyone getting here, I would suggest looking at "singleton", I found a good example in the following link, which I am pasting below.

Example taken from https://researchcomputing.princeton.edu/support/knowledge-base/slurm

#!/bin/bash
#SBATCH --job-name=LongJob       # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=4G                 # memory per node (4G per cpu-core is default)
#SBATCH --time=00:01:00          # total run time limit (HH:MM:SS)
#SBATCH --dependency=singleton   # job dependency
#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-user=<YourNetID>@princeton.edu

module purge
module load anaconda3/2020.11
conda activate galaxy-env

python myscript.py

Notice the line #SBATCH --dependency=singleton

And then run multiple times like so:

$ sbatch job.slurm   # step 1
$ sbatch job.slurm   # step 2
$ sbatch job.slurm   # step 3
$ sbatch job.slurm   # step 4
$ sbatch job.slurm   # step 5

Upvotes: 0

Shirshak55

Reputation: 528

Time limit is set by admin and that is defined in slurm.conf at /etc/slurm/slurm.conf. There should be partition that defines the limit.

and I am afraid you cannot bypass that limit.

So the only thing that you can do is:

Run for 24 hour and before 24 hour is reached save all the state. (It can be difficult afaik)
Ask admin to increase the timeout
Use more number of nodes,core, threads?

For 1 you need to modify the program and save state which most program should provide if they are supposed to run for long duration?

It seems you are from Nepal and if you happen to run it in Kathmandu University HPC you can ask administration they should help you here.

Regarding your second question:

#SBATCH --nodes=
#SBATCH --ntasks-per-node=
#SBATCH --ntasks=
#SBATCH --ntasks-per-core=

nodes means number of physical node.

For ntask related thing I recommend you to look on this link: What does the --ntasks or -n tasks does in SLURM?

Upvotes: 1

How to run a longer job in SLURM if the default time limit of partition is not sufficient?

Answers (2)

Related Questions