Reputation: 182
I get an annoying write error during a pipeline execution which shutdown everything and cannot understand the reason. I don't know whether this is a bug or a usage problem, hence posting here before posting an issue on the Snakemake repo.
Snakemake version : 7.19.1
Describe the bug
I randomly get the following error during the execution of a pipeline on a cluster. At the moment, I can't find the reason/context that make it happen.
Looks like, for some reason, Snakemake cannot write a temporary .sh script. Although, it has no problem writing the same script for other wildcards sets.
It might be related to the cluster I'm using and not Snakemake, but I would like to make sure of that.
Logs
Traceback (most recent call last):
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/__init__.py", line 752, in snakemake
success = workflow.execute(
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/workflow.py", line 1089, in execute
raise e
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/workflow.py", line 1085, in execute
success = self.scheduler.schedule()
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/scheduler.py", line 592, in schedule
self.run(runjobs)
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/scheduler.py", line 641, in run
executor.run_jobs(
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in run_jobs
self.run(
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 1156, in run
self.write_jobscript(job, jobscript)
File "/home/sjuhel/mambaforge/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 901, in write_jobscript
with open(jobscript, "w") as f:
PermissionError: [Errno 13] Permission denied: '/data/sjuhel/BoARIO-inputs/.snakemake/tmp.7lly73mm/snakejob.run_generic.11469.sh'
Minimal example The error doesn't happen when running a smaller pipeline.
Additional context
I am running the following pipeline of simulations on a cluster using SLURM.
Profile :
cluster:
mkdir -p /scratchu/sjuhel/logs/smk/{rule} &&
sbatch
--parsable
--mem={resources.mem_mb}
--job-name=smk-{rule}-{wildcards}
--cpus-per-task={threads}
--output=/scratchu/sjuhel/logs/smk/{rule}/{rule}-{wildcards}-%j.out
--time={resources.time}
--partition={resources.partition}
default-resources:
- mem_mb=2000
- partition=zen16
- time=60
- threads=4
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 16
keep-going: False
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True
conda-frontend: mamba
cluster-status: status-sacct.sh
The pipeline is runned via :
nohup snakemake generate_csv_from_all_xp --profile simple > "../Runs/snakelog-$(date +"%FT%H%M%z").log" 2>&1 &
Rules to run (last execution with the error):
Using shell: /usr/bin/bash
Provided cluster nodes: 16
Job stats:
job count min threads max threads
------------------------ ------- ------------- -------------
generate_csv_from_all_xp 1 1 1
generate_csv_from_xp 3 1 1 -> indicator aggregation
indicators 4916 1 1 -> symlink to exp folder (multiple exps can share similar results)
indicators_generic 4799 1 1 -> indicators from simulations
init_all_sim_parquet 1 1 1
run_generic 4771 4 4 -> simulations to run
xp_parquet 3 1 1 -> regroup simulations by different scenarios
total 14494 1 4
What I don't understand is that Snakemake is able to write other temporary .sh files, and I don't understand at which point the error happen. And I have no ideas on how to debug this.
Edit [23/01/2023] :
The issue might be related to exiting the ssh session on the cluster. Could it be possible that the nohuped snakemake process cannot write files once I am no longer connected to the server ?
→ I will try with screen instead of nohup.
Upvotes: 1
Views: 1074
Reputation: 182
There is a very high probability that Snakemake lost the rights to write the temporary scripts files when I disconnected from the server.
I solved the problem by invoking the following script with sbatch :
#! /bin/bash
#SBATCH --ntasks 1
#SBATCH --time=2-00:00:00
#SBATCH --job-name=Snakemake
#SBATCH --mem=2048
for i in $(seq 1 10)
do
snakemake --batch generate_csv_from_all_xp=$i/10 --profile simple > "../Runs/snakelog-$(date +"%FT%H%M%z").log" 2>&1
done
I haven't tried tmux
or screen
but it should probably work as well. As the dag is a bit heavy to compute, I thought best to not run it on the login node of the cluster (+ with sbatch I get to have email notification of the job ending/failing)
Note that I read that Running snakemake on login nodes is unlikely to pose problems
Upvotes: 1