obk
obk

Reputation: 740

snakemake rule to delete file

I have a large snakemake file that looks like this (after simplifying a lot).

rule a:
    input: '{path}.csv'
    output: '{path}.a.csv'
    shell: 'cp {input} {output}'
rule b:
    input: '{path}.csv'
    output: '{path}.b.csv'
    shell: 'cp {input} {output}'
rule c:
    input: '{path}.csv'
    output: '{path}.c.csv'
    shell: 'cp {input} {output}'
rule d:
    input: '{path}.csv'
    output: '{path}.d.csv'
    shell: 'cp {input} {output}'
rule all:
    input: 'raw1.a.b.c.a.d.csv',
           'raw2.a.b.c.d.a.csv'

(This setup lets me use rules like functions, by chaining their filename suffixes in the all rule.)

Starting state:

$ ls -tr1
Snakefile
raw1.csv
raw2.csv

$ snakemake all
...

After:

$ ls -tr1
Snakefile
raw1.csv
raw2.csv
raw2.a.csv
raw2.a.b.csv
raw2.a.b.c.csv
raw2.a.b.c.d.csv
raw1.a.csv
raw1.a.b.csv
raw1.a.b.c.csv
raw1.a.b.c.a.csv
raw1.a.b.c.a.d.csv
raw2.a.b.c.d.a.csv

Now, I'd like to add a rule that deletes specific intermediate files (for example raw1.a.csv and raw2.a.b.csv) because I don't need them and they take up a lot of disk space. I can't mark the outputs with temp() because of the wildcard {path}.

Any tips? Thanks.

Upvotes: 0

Views: 1104

Answers (2)

obk
obk

Reputation: 740

EDIT: Actually, this solution doesn't work.. it results in race condition...


Ok, I figured it out...

rule a:
    input: '{path}.csv'
    output: '{path}.a.csv'
    shell: 'cp {input} {output}'
rule b:
    input: '{path}.csv'
    output: '{path}.b.csv'
    shell: 'cp {input} {output}'
rule c:
    input: '{path}.csv'
    output: '{path}.c.csv'
    shell: 'cp {input} {output}'
rule d:
    input: '{path}.csv'
    output: '{path}.d.csv'
    shell: 'cp {input} {output}'
rule remove:                          # <-- rule to delete a file
    input: '{path}'
    output: touch('{path}.removed')
    shell: 'rm {input}'
rule all:
    input: 'raw1.a.b.c.a.d.csv',
           'raw2.a.b.c.d.a.csv',
           'raw1.a.csv.removed',      # <-- specify which files to rm
           'raw2.a.b.c.csv.removed',  # <-- specify which files to rm

and here's the dag:

$ snakemake --dag all | dot -Tpng > dag.png

enter image description here

Upvotes: 0

Manavalan Gajapathy
Manavalan Gajapathy

Reputation: 4089

temp() does work in this scenario.

rule all:
    input: 'raw1.a.b.c.a.d.csv',
        'raw2.a.b.c.d.a.csv'

rule a:
    input: '{path}.csv'
    output: temp('{path}.a.csv')
    shell: 'cp {input} {output}'
rule b:
    input: '{path}.csv'
    output: '{path}.b.csv'
    shell: 'cp {input} {output}'
rule c:
    input: '{path}.csv'
    output: temp('{path}.c.csv')
    shell: 'cp {input} {output}'
rule d:
    input: '{path}.csv'
    output: '{path}.d.csv'
    shell: 'cp {input} {output}'

Executing this would result in creation of files raw1.a.b.c.a.d.csv , raw1.a.b.csv, raw2.a.b.c.d.csv, raw2.a.b.csv and auto-deletion of files raw1.a.csv, raw2.a.csv, raw1.a.b.c.csv, raw2.a.b.c.csv, raw1.a.b.c.a.csv, raw2.a.b.c.d.a.csv.

Upvotes: 1

Related Questions