Antje Janosch
Antje Janosch

Reputation: 1196

Snakemake - files as input/output with timestamp

I try to setup a snakefile (snakemake 7.19.1). The final output should contain a timestamp. This is a minimal example:

#!/bin/python
print('import packages')

from datetime import datetime
import time

now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
tarfile = now + '_stardist.tar'
print(tarfile)

rule all:
    input: {tarfile}

rule main_rule:
    input: 
    output: {tarfile}
    run: 
        shell('touch ' + output[0])

If I run that snakefile with snakemake -s snakefile --cores 1 I get the following output resulting in an error message:

import packages
2025-02-04_11-33-38_stardist.tar
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job          count    min threads    max threads
---------  -------  -------------  -------------
all              1              1              1
main_rule        1              1              1
total            2              1              1

Select jobs to execute...

[Tue Feb  4 11:33:38 2025]
rule main_rule:
    output: 2025-02-04_11-33-38_stardist.tar
    jobid: 1
    reason: Missing output files: 2025-02-04_11-33-38_stardist.tar
    resources: tmpdir=/tmp

import packages
2025-02-04_11-33-39_stardist.tar
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Select jobs to execute...
Waiting at most 5 seconds for missing files.
MissingOutputException in rule main_rule in file /lustre/projects/xxx/test_timestamp_minimal/snakefile, line 14:
Job 1  completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
2025-02-04_11-33-38_stardist.tar
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2025-02-04T113338.450326.snakemake.log

As it is now the name of tarfile is created twice and the file which is finally created does not have the same timestamp like the file which is expected as output. Can anybody please help?

Update

I created a fresh conda environment with python 3.12 and snakemake 8.27.1. I tested my code on a local Windows10 machine (miniforge) and on CentOS7 (the head node of an HPC environment). While there is no issue on my local machine, I get the same error on the HPC system.

This is the output on the local machine (I do not put again the output of the HPC-node as it's comparable to what I've already reported):

import packages
2025-02-05_13-13-27_stardist.tar
Assuming unrestricted shared filesystem usage.
host: screening-pc-4
Building DAG of jobs...
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job          count
---------  -------
all              1
main_rule        1
total            2

Select jobs to execute...
Execute 1 jobs...

[Wed Feb  5 13:13:27 2025]
localrule main_rule:
    output: 2025-02-05_13-13-27_stardist.tar
    jobid: 1
    reason: Missing output files: 2025-02-05_13-13-27_stardist.tar
    resources: tmpdir=C:\Users\xxx\AppData\Local\Temp

[Wed Feb  5 13:13:27 2025]
Finished job 1.
1 of 2 steps (50%) done
Select jobs to execute...
Execute 1 jobs...

[Wed Feb  5 13:13:27 2025]
localrule all:
    input: 2025-02-05_13-13-27_stardist.tar
    jobid: 0
    reason: Input files updated by another job: 2025-02-05_13-13-27_stardist.tar
    resources: tmpdir=C:\Users\xxx\AppData\Local\Temp

[Wed Feb  5 13:13:27 2025]
Finished job 0.
2 of 2 steps (100%) done
Complete log: .snakemake\log\2025-02-05T131327.214796.snakemake.log

What I do not understand: In the case of the Win10 environment (where it's working): Snakemake reports the following lines just once:

import packages
2025-02-04_14-14-23_stardist.tar

while when it goes wrong, the snakemake file seems to be run twice (once for each job)? So the first execution gets a different filename due to the execution time compared to the second job.

Upvotes: 0

Views: 68

Answers (3)

Wayne
Wayne

Reputation: 9954

This is a current, known gotcha.

"python twice" as search terms in Snakemake's GitHub Issues page came up with the following still open (at this time) issue posted mid-2023 & labeled as a 'bug', although maybe that label is erroneous according to Tim Booth because of parallel job handling:

ISSUE: 'Snakefile is run a second time if rule contains run directive'.
Quote from stefmldk in that thread (linked specifically above):

"I can confirm this observation (version 7.32.4). Frustrating if you use time stamps in file names. It seems that a list of output files to be expected is generated in the first run only, but if a time stamp and thus file names get changed on the second run, the files expected by Snakemake never get created and a MissingOutputException is thrown after 5 seconds. Temporary solution that works most of the time: Exclude seconds from time stamps."

This was the exact issue and work-around we worked out via the comments in looking into this.
To highlight the current work-around that agrees with our findings:

"Temporary solution that works most of the time: Exclude seconds from time stamps."

Upvotes: 0

Tim Booth
Tim Booth

Reputation: 713

Despite what commenters have said in issue #2350, this phenomenon of Snakemake running the python code twice is a feature, not a bug, and it's not going to be "fixed". Even if the behaviour was changed for the simple -j1 case it will still manifest as soon as you allow parallel jobs or submit your jobs to a HPC cluster (this includes your fixed code, @antje-janosch!)

The actual bug in this code is on line 16. If you replace {tarfile} with "{tarfile}" (only on line 16, not line 12) in the original example you will find that it works, just by adding the two quotes.

The difference is this: {tarfile} means "a python set containing the single fixed string from the variable tarfile", whereas "{tarfile}" is a template string that will act as a placeholder for any filename. These are not the same!

Upvotes: 1

Antje Janosch
Antje Janosch

Reputation: 1196

Thanks to @Wayne!

According to the issue Snakefile is run a second time if rule contains run directive #2350 I decided to use "shell" instead of "run" in my main_rule. This prevents snakemake from running the snakefile code twice. My working snakefile now looks like this:

#!/bin/python
print('import packages')

from datetime import datetime
import time

now = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
tarfile = now + '_stardist.tar'
print(tarfile)

rule all:
    input: {tarfile}

rule main_rule:
    input: 
    output: {tarfile}
    shell: 
        "touch {tarfile}"

Upvotes: 0

Related Questions