Reputation: 386
I am having troubles trying to execute the final rule in my Snakefile once for each input I provide it. It currently uses a partial expand to fill one value, seen in rule download
.
However, when using the expand
function, the rule sees the input as a single list of strings, and is executed one time. I would like to have three executions of the rule, with the input being one string for each, so the downloads are able to happen in parallel.
Here is the Snakefile I am using:
Snakefile
import csv
def get_tissue_name():
tissue_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
id = line[1].split("_")[0] # naiveB_S1R1 -> naiveB
tissue_data.append(id)
return tissue_data
def get_tag_data():
tag_data = []
with open("master_init.csv", "r") as rfile:
reader = csv.reader(rfile)
for line in reader:
tag = line[1].split("_")[-1]
tag_data.append(tag) # example: S1R1
return tag_data
rule all:
input:
# Execute distribute & download
expand(os.path.join("output","{tissue_name}"),
tissue_name=get_tissue_name())
rule distribute:
input: "master_init.csv"
output: "init/{tissue_name}_{tag}.csv"
params:
id = "{tissue_name}_{tag}"
run:
lines = open(str(input), "r").readlines()
wfile = open(str(output), "w")
for line in lines:
line = line.rstrip() # remove trailing newline
# Only write line if the output file has the current
# tissue-name_tag (naiveB_S1R1) in the file name
if params.id in line:
wfile.write(line)
wfile.close()
rule download:
input: expand("init/{tissue_name}_{tag}.csv",
tag=get_tag_data(), allow_missing=True)
output: directory(os.path.join("output", "{tissue_name}"))
shell:
"""
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir {output}
done < {input}
"""
When I execute this Snakefile, I get the following output:
> snakemake --cores 1 --dry-run
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------- ------- ------------- -------------
all 1 1 1
distribute 3 1 1
download 1 1 1
total 5 1 1
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R1.csv
jobid: 2
wildcards: tissue_name=naiveB, tag=S1R1
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 2.
1 of 5 steps (20%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R2.csv
jobid: 3
wildcards: tissue_name=naiveB, tag=S1R2
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 3.
2 of 5 steps (40%) done
[Wed Sep 1 15:28:27 2021]
rule distribute:
input: master_init.csv
output: init/naiveB_S1R3.csv
jobid: 4
wildcards: tissue_name=naiveB, tag=S1R3
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
[Wed Sep 1 15:28:27 2021]
Finished job 4.
3 of 5 steps (60%) done
[Wed Sep 1 15:28:27 2021]
rule download:
input: init/naiveB_S1R1.csv, init/naiveB_S1R2.csv, init/naiveB_S1R3.csv
output: output/naiveB
jobid: 1
wildcards: tissue_name=naiveB
resources: tmpdir=/var/folders/sr/gzlz2wcs5tz1jns1j13m57jr0000gn/T
/bin/bash: -c: line 2: syntax error near unexpected token `init/naiveB_S1R2.csv'
/bin/bash: -c: line 2: ` done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv'
[Wed Sep 1 15:28:27 2021]
Error in rule download:
jobid: 1
output: output/naiveB
shell:
while read srr name endtype; do
fastq-dump --split-files --gzip $srr --outdir output/naiveB
done < init/naiveB_S1R1.csv init/naiveB_S1R2.csv init/naiveB_S1R3.csv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /Users/joshl/PycharmProjects/FastqToGeneCounts/exampleSnakefile/.snakemake/log/2021-09-01T152827.234141.snakemake.log
I am getting an error in the execution of rule download
because of the done < {input}
portion. The entirety of the input is being used to read from, as opposed to single files. In an ideal execution, rule download
would execute three separate times, once for each input file.
A simple fix is to wrap the while . . . done
section in a for loop, but then I lose the ability to download multiple SRR files at the same time.
Does anyone know if this is possible?
Upvotes: 1
Views: 363
Reputation: 6584
You cannot execute a single rule multiple times for the same output. In your rule download
output depends only on tissue_name, and doesn't depend on tag.
You have a choice: either to provide a filename that depends on tag as an output (like the filename you are downloading), or create a loop in the rule itself.
Upvotes: 2