Reputation: 447
I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.
I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.
#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647
Snakefile below:
from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil
configfile: "config.json"
data_dir=os.getcwd()
units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())
#print(samples)
rule all:
input:
expand("out/{sample}.fastq.gz",sample=samples)
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/download_sample.smk'
download_sample.smk
rule download_sample:
"""
Download RNA-Seq data from SRA.
"""
input: "{sample}"
output: expand("out/{sample}.fastq.gz", sample=samples)
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
I have tried many different variants of the above code, but somewhere I am getting it wrong.
What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed
parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip
This is the error I get
snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'
Thanks in advance
Upvotes: 4
Views: 1605
Reputation: 9062
It seems to me that what you need is to access the sample
wildcard using the wildcards
object:
rule all:
input: expand("out/{sample}_fastq.gz", sample=samples)
rule download_sample:
output:
"out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir} --gzip "
Upvotes: 4
Reputation: 6600
The first solution could be to use the run:
section of the rule instead of the shell:
. This allows you to employ python code:
rule download_sample:
# ...
run:
for input_file in input:
shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")
This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz
file you need a single {sample}
. The best solution would be to reduce your rule to the one that makes a single file:
rule download_sample:
input: "{sample}"
output: "out/{sample}_fastq.gz"
params:
outdir = "out",
threads = 16
priority:85
shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "
The rule all:
now requires all targets; the rule download_sample
downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample
per sample. Moreover, if you wish it can run these rules in parallel.
Upvotes: 1