Nikko
Nikko

Reputation: 1572

Snakemake: how to produce multiple outputs from one input

I am trying to make a snakemake workflow for Earth Observation applications and I have to download data from S3. First, I have a rule that query the data I need based on parameters in a file. The output of this rule is a list containing the data I need to download.

localrules: all

rule all:
    input:
        'results/test.csv'

rule query:
    input: 'input/{file}.csv'
    output: 'results/{file}.csv'
    shell: 'python search_catalog.py {input} {output}'

Now, I need to download those data. How can I make a rule to read the list and download each data listed and have it as the output of rule? Where can I read the content of results/something.csv and declare them in DATASET?

rull download:
    input: 'results/{file}.csv'
    output: expand('data/{file}' file=DATASET)
    shell: 'aws s3 cp s3://eodata/Sentinel-2/MSI/L2A/2024/01/15/{output}'

Upvotes: 0

Views: 185

Answers (1)

Tim Booth
Tim Booth

Reputation: 713

The short answer is that you can't. The way that Snakemake works is it builds a DAG (ie. its work plan) by starting with a desired final output file (the target) and looking for a rule that could generate the desired file. If that rule needs inputs it looks for rules to generate those files, and keeps working backwards until it runs out of rules. It resolves all of the inputs and outputs to all the jobs in the DAG before it runs any shell commands.

So you are thinking that Snakemake starts with an input, runs the shell command, and gets a bunch of outputs. But that's not the case - it starts with an output filename, work out what the input would be, then runs the shell command, so you need to resolve things in that order.

In this case, I'd need to see more details to make a firm recommendation, but what you probably need to do is to split your workflow in two parts:

  1. First part will just download the files. You can have an output directory per .csv file if you are processing multiple .csv files at once. You can make the whole directory be the output of the rule. Or else this part may be easier to implement as a shell script, or in vanilla Python.

  2. For the actual workflow, now you have the files you can use glob_wildcards() (or just regular glob()) to initiate jobs based on the files that are downloaded, or you can parse the CSV files to get the file names. This logic will probably need to be in an input function attached to the driver rule (ie. rule all). There are examples in the Snakemake documentation and tutorial.

It is possible to do everything in one shot by using checkpoint rules. These are nifty, but before trying to use these you want to get the two-part solution working, then using a checkpoint will allow you to run both the parts with a single snakemake command, if that actually turns out to be useful.

Upvotes: 1

Related Questions