SJuhel
SJuhel

Reputation: 182

Random "[Errno 13] Permission denied" during snakemake pipeline execution on cluster

I get an annoying write error during a pipeline execution which shutdown everything and cannot understand the reason. I don't know whether this is a bug or a usage problem, hence posting here before posting an issue on the Snakemake repo.

Snakemake version : 7.19.1

Describe the bug

I randomly get the following error during the execution of a pipeline on a cluster. At the moment, I can't find the reason/context that make it happen.

Looks like, for some reason, Snakemake cannot write a temporary .sh script. Although, it has no problem writing the same script for other wildcards sets.

It might be related to the cluster I'm using and not Snakemake, but I would like to make sure of that.

Logs

    Traceback (most recent call last):
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/__init__.py", line 752, in snakemake
        success = workflow.execute(
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/workflow.py", line 1089, in execute
        raise e
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/workflow.py", line 1085, in execute
        success = self.scheduler.schedule()
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/scheduler.py", line 592, in schedule
        self.run(runjobs)
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/scheduler.py", line 641, in run
        executor.run_jobs(
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in run_jobs
        self.run(
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 1156, in run
        self.write_jobscript(job, jobscript)
      File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 901, in write_jobscript
        with open(jobscript, "w") as f:
    PermissionError: [Errno 13] Permission denied: '/data/sjuhel/BoARIO-inputs/.snakemake/tmp.7lly73mm/snakejob.run_generic.11469.sh'


Minimal example The error doesn't happen when running a smaller pipeline.

Additional context

I am running the following pipeline of simulations on a cluster using SLURM.

Profile :

cluster:
  mkdir -p /scratchu/sjuhel/logs/smk/{rule} &&
  sbatch
    --parsable
    --mem={resources.mem_mb}
    --job-name=smk-{rule}-{wildcards}
    --cpus-per-task={threads}
    --output=/scratchu/sjuhel/logs/smk/{rule}/{rule}-{wildcards}-%j.out
    --time={resources.time}
    --partition={resources.partition}
default-resources:
  - mem_mb=2000
  - partition=zen16
  - time=60
  - threads=4
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 16
keep-going: False
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True
conda-frontend: mamba
cluster-status: status-sacct.sh

The pipeline is runned via :

nohup snakemake generate_csv_from_all_xp --profile simple > "../Runs/snakelog-$(date +"%FT%H%M%z").log" 2>&1 &

Rules to run (last execution with the error):

Using shell: /usr/bin/bash
Provided cluster nodes: 16
Job stats:
job                         count    min threads    max threads
------------------------  -------  -------------  -------------
generate_csv_from_all_xp        1              1              1
generate_csv_from_xp            3              1              1 -> indicator aggregation
indicators                   4916              1              1 -> symlink to exp folder (multiple exps can share similar results)
indicators_generic           4799              1              1 -> indicators from simulations
init_all_sim_parquet            1              1              1
run_generic                  4771              4              4 -> simulations to run
xp_parquet                      3              1              1 -> regroup simulations by different scenarios
total                       14494              1              4

What I don't understand is that Snakemake is able to write other temporary .sh files, and I don't understand at which point the error happen. And I have no ideas on how to debug this.

Edit [23/01/2023] :

The issue might be related to exiting the ssh session on the cluster. Could it be possible that the nohuped snakemake process cannot write files once I am no longer connected to the server ?

→ I will try with screen instead of nohup.

Upvotes: 1

Views: 1074

Answers (1)

SJuhel
SJuhel

Reputation: 182

There is a very high probability that Snakemake lost the rights to write the temporary scripts files when I disconnected from the server.

I solved the problem by invoking the following script with sbatch :

#! /bin/bash
#SBATCH --ntasks 1     
#SBATCH --time=2-00:00:00
#SBATCH --job-name=Snakemake
#SBATCH --mem=2048
for i in $(seq 1 10)
 do
    snakemake --batch generate_csv_from_all_xp=$i/10 --profile simple > "../Runs/snakelog-$(date +"%FT%H%M%z").log" 2>&1
 done

I haven't tried tmux or screen but it should probably work as well. As the dag is a bit heavy to compute, I thought best to not run it on the login node of the cluster (+ with sbatch I get to have email notification of the job ending/failing)

Note that I read that Running snakemake on login nodes is unlikely to pose problems

Upvotes: 1

Related Questions