Josh Loecker
Josh Loecker

Reputation: 386

Snakemake partial expand, using one output file per execution in the following rule

I am having troubles trying to execute the final rule in my Snakefile once for each input I provide it. It currently uses a partial expand to fill one value, seen in rule download.

However, when using the expand function, the rule sees the input as a single list of strings, and is executed one time. I would like to have three executions of the rule, with the input being one string for each, so the downloads are able to happen in parallel.

Here is the Snakefile I am using:

Snakefile

import csv

def get_tissue_name():
    tissue_data = []
    with open("master_init.csv", "r") as rfile:
        reader = csv.reader(rfile)
        for line in reader:
            id = line[1].split("_")[0]  # naiveB_S1R1 -> naiveB
            tissue_data.append(id)

    return tissue_data


def get_tag_data():
    tag_data = []
    with open("master_init.csv", "r") as rfile:
        reader = csv.reader(rfile)
        for line in reader:
            tag = line[1].split("_")[-1]
            tag_data.append(tag)  # example: S1R1

    return tag_data


rule all:
    input:
        # Execute distribute & download
        expand(os.path.join("output","{tissue_name}"),
               tissue_name=get_tissue_name())

rule distribute:
    input: "master_init.csv"
    output: "init/{tissue_name}_{tag}.csv"
    params:
        id = "{tissue_name}_{tag}"
    run:
        lines = open(str(input), "r").readlines()
        wfile = open(str(output), "w")

        for line in lines:
            line = line.rstrip()  # remove trailing newline

            # Only write line if the output file has the current 
            # tissue-name_tag (naiveB_S1R1) in the file name
            if params.id in line:
                wfile.write(line)

        wfile.close()

rule download:
    input: expand("init/{tissue_name}_{tag}.csv", 
                  tag=get_tag_data(), allow_missing=True)
    output: directory(os.path.join("output", "{tissue_name}"))
    shell:
        """
        while read srr name endtype; do
            fastq-dump --split-files --gzip $srr --outdir {output}
        done < {input}
        """

When I execute this Snakefile, I get the following output:

> snakemake --cores 1 --dry-run

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job           count    min threads    max threads
----------  -------  -------------  -------------
all               1              1              1
distribute        3              1              1
download          1              1              1
total             5              1              1


[Wed Sep  1 15:28:27 2021]
rule distribute:
    input: master_init.csv
    output: init/naiveB_S1R1.csv
    jobid: 2
    wildcards: tissue_name=naiveB, tag=S1R1
    resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T

[Wed Sep  1 15:28:27 2021]
Finished job 2.
1 of 5 steps (20%) done

[Wed Sep  1 15:28:27 2021]
rule distribute:
    input: master_init.csv
    output: init/naiveB_S1R2.csv
    jobid: 3
    wildcards: tissue_name=naiveB, tag=S1R2
    resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T

[Wed Sep  1 15:28:27 2021]
Finished job 3.
2 of 5 steps (40%) done

[Wed Sep  1 15:28:27 2021]
rule distribute:
    input: master_init.csv
    output: init/naiveB_S1R3.csv
    jobid: 4
    wildcards: tissue_name=naiveB, tag=S1R3
    resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T

[Wed Sep  1 15:28:27 2021]
Finished job 4.
3 of 5 steps (60%) done

[Wed Sep  1 15:28:27 2021]
rule download:
    input: init/naiveB_S1R1.csv, init/naiveB_S1R2.csv, init/naiveB_S1R3.csv
    output: output/naiveB
    jobid: 1
    wildcards: tissue_name=naiveB
    resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T

/bin/bash: -c: line 2: syntax error near unexpected token `init/naiveB_S1R2.csv'
/bin/bash: -c: line 2: `        done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv'
[Wed Sep  1 15:28:27 2021]
Error in rule download:
    jobid: 1
    output: output/naiveB
    shell:

        while read srr name endtype; do
            fastq-dump --split-files --gzip $srr --outdir output/naiveB
        done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/joshl/PycharmProjects/FastqToGeneCounts/exampleSnakefile/.snakemake/log/2021-09-01T152827.234141.snakemake.log

I am getting an error in the execution of rule download because of the done < {input} portion. The entirety of the input is being used to read from, as opposed to single files. In an ideal execution, rule download would execute three separate times, once for each input file.

A simple fix is to wrap the while . . . done section in a for loop, but then I lose the ability to download multiple SRR files at the same time.

Does anyone know if this is possible?

Upvotes: 1

Views: 363

Answers (1)

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6584

You cannot execute a single rule multiple times for the same output. In your rule download output depends only on tissue_name, and doesn't depend on tag.

You have a choice: either to provide a filename that depends on tag as an output (like the filename you are downloading), or create a loop in the rule itself.

Upvotes: 2

Related Questions