Embra_QN
Embra_QN

Reputation: 1

Snakemake process multiple files in one rule

What is the best way to process a list of files in one rule?

Workflow

The main goal of the workflow is to select the raw data and output the selected data. The directory of the workflow is structured as below.

.
├── data
│   ├── 000_raw
│   │   ├── 15_a.csv
│   │   ├── 15_b.csv
│   │   ├── 15_c.csv
│   │   ├── 16_a.csv
│   │   ├── 16_b.csv
│   │   └── 16_c.csv
│   └── 010_sel
│       ├── 15_a.csv
│       ├── 15_b.csv
│       ├── 15_c.csv
│       ├── 16_a.csv
│       ├── 16_b.csv
│       └── 16_c.csv
├── scripts
│   └── 010_sel.py
└── Snakefile

The selection script 010_sel.py read and produce one file at each time, i.e. the common way to run it is

python scripts/010_sel.py data/000_raw/15_a.csv data/010_sel/15_a.csv

Snakefile

I use expand and run method in the snakemake file.

ls_year_type = [15_a,15_b,15_c,16_a,16_b,16_c]

rule sel_010:
    input:
        expand("data/000_raw/{year_type}.csv",year_mag=ls_year_type)
    output:
        expand("data/010_sel/{year_type}.csv",year_mag=ls_year_type)
    run: 
        for ifile in range(len(output)):
            os.system("python scripts/010_sel.py {} {}".format(input[ifile],output[ifile]))

Problems

There are two problems with this method.

Optional method

One optional way is to rewrite the 010_sel.py to include snakemake commands rather than using sys.argv

for i in range(len(snakemake.input)):
    input_file = snakemake.input[i]
    output_file = snakemake.output[i]

In snakemake file change run to script

script:
    "scripts/010_sel.py"

This will solve the second problem but the first one remains.

Thanks in advance for any help.

Upvotes: 0

Views: 609

Answers (1)

Eric C.
Eric C.

Reputation: 3368

Snakemake is a workflow manager with many features including parallelization, error recovery and code change awareness. It allows to define the logic of a process and apply it to many samples using wildcards, scaling it as needed.

By defining a for loop inside a rule, you're a basically loosing all the features above.
Here's how you (probably) should do it:

ls_year_type = [15_a,15_b,15_c,16_a,16_b,16_c]

rule all:
    expand("data/010_sel/{year_type}.csv",year_mag=ls_year_type)

rule sel_010:
    input:
        "data/000_raw/{year_type}.csv"
    output:
        "data/010_sel/{year_type}.csv"
    shell: 
        "python scripts/010_sel.py {input} {output}"

and running snakemake with:

$ snakemake --jobs 5

adjust the number of parallel job as needed, depending on your computer or HPC.

You can also use the script tag as you state. Snakemake will create a python object containing the command line arguments that you can process in the script:

rule sel_010:
    input:
        "data/000_raw/{year_type}.csv"
    output:
        "data/010_sel/{year_type}.csv"
    script: 
        "scripts/010_sel.py"

By doing it this way, you're letting snakemake taking care of the scaling. If one of the input file is changed, snakemake will only re-run the rule on the changed file. If the code of the script is changed, snakemake will re-run all the files. You can also change this behavior by using the snakemake option --cleanup-metadata

Upvotes: 0

Related Questions