Reputation: 43
I have a snakemake (7.22.0) that's stalling after they start. I have rules that run on a cluster (through pbs) and execute an external Python script. I noticed that now some of the rules stall for very long before executing the script - the job starts, and snakemake outputs that it has started running, but then the actual script starts only 2hrs later. The output I get from the job is thus something like this:
[Tue Oct 15 23:13:13 2024]
rule ...:
input: ...
output: ...
jobid: 0
reason: Missing output files: ...
wildcards: ...
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/var/tmp/pbs.<job id>.<cluster name>
2024-10-16 01:21:37.393620 log from first line of the script
...
2024-10-16 01:21:41.212192 log from last line of the script (after reading large files)
Not cleaning up <tmp script path>
[Wed Oct 16 01:21:41 2024]
Finished job 0.
1 of 1 steps (100%) done
Has anyone experienced something like this? What might snakemake be doing that might cause this? I'm generating lots of files in the workflow (only one in this job), so it's a suspect cause, but I don't entirely see how this might cause this. Also, the top-level "all" triggers many other rules (thousdands - but using a limit on the number of jobs submitted to pbs), and executing that takes ~20 minutes, but this is not the rule executing here. Other instances of the same rule execute normally too.
These are the statistics from pbs at some point during the job's execution, from a time before the external script started:
Job Id: ...
Job_Name = snakejob....
Job_Owner = ...
resources_used.cpupercent = 4
resources_used.cput = 00:00:44
resources_used.mem = 231660kb
resources_used.ncpus = 1
resources_used.vmem = 977976kb
resources_used.walltime = 00:54:14
The memory consumption seems excessive, I'm not sure? Is there something snakemake does on startup that can use so much memory (in extreme conditions, whatever they may be)?
Upvotes: 1
Views: 60
Reputation: 43
The problem turned out to be that the directory workdir/.snakemake/scripts
had become bogged down with many files (~600,000) from previous runs of the workflow. Deleting the old scripts there solved the problem.
Upvotes: 0