Varying (known) number of outputs in Snakemake

Question

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like

rule unpack:
    input: '{id}.archive'
    output: 
        lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]

but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do

rule unpack:
    input: '{id}.archive'
    output: '{id}/{outfile}'

and run the rule several times for each archive. Another alternative could be

rule unpack:
    input: '{id}.archive'
    output: '{id}/{outfile}'
    run:
        if os.path.isfile(output[0]):
            return
        ...

but I am afraid that would introduce a race condition.

Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.

Johannes K&#246;ster · Accepted Answer

Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:

for id, contents in ARCHIVE_CONTENTS.items():
    rule:
        input: 
            '{id}.tar.gz'.format(id=id)
        output: 
            expand('{id}/{outfile}', outfile=contents)
        shell:
            'tar -C {wildcards.id} -xf {input}'

Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:

rule unpack:
    input:
        '{id}.tar.gz'
    output:
        '{id}/{outfile}'
    shell:
        'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

Varying (known) number of outputs in Snakemake

Answers (1)

Related Questions