user10101904
user10101904

Reputation: 447

Snakemake - How to use every line of input file as wildcard

I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.

I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.

#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647

Snakefile below:

from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil

configfile: "config.json"

data_dir=os.getcwd()

units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())

#print(samples)

rule all:
    input:
           expand("out/{sample}.fastq.gz",sample=samples)

rule clean:
     shell: "rm -rf .snakemake/"

include: 'rules/download_sample.smk'

download_sample.smk

rule download_sample:
    """
    Download RNA-Seq data from SRA.
    """
    input: "{sample}"
    output: expand("out/{sample}.fastq.gz", sample=samples)
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir}  --gzip "

I have tried many different variants of the above code, but somewhere I am getting it wrong.

What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed

parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip

This is the error I get

snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'

Thanks in advance

Upvotes: 4

Views: 1605

Answers (2)

dariober
dariober

Reputation: 9062

It seems to me that what you need is to access the sample wildcard using the wildcards object:

rule all:
    input: expand("out/{sample}_fastq.gz", sample=samples)

rule download_sample:
    output: 
        "out/{sample}_fastq.gz"
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir}  --gzip "

Upvotes: 4

Dmitry Kuzminov
Dmitry Kuzminov

Reputation: 6600

The first solution could be to use the run: section of the rule instead of the shell:. This allows you to employ python code:

rule download_sample:
    # ...
    run:
        for input_file in input:
            shell(f"parallel-fastq-dump --sra-id {input_file} --threads {params.threads} --outdir {params.outdir} --gzip")

This straightforward solution however is not idiomatic. From what I can see, you have a one-to-one relationship between input samples and output files. In other words to produce one out/{sample}_fastq.gz file you need a single {sample}. The best solution would be to reduce your rule to the one that makes a single file:

rule download_sample:
    input: "{sample}"
    output: "out/{sample}_fastq.gz"
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir} --gzip "

The rule all: now requires all targets; the rule download_sample downloads a single sample, the Snakemake workflow does the rest: it constructs a graph of dependences and creates one instance of the rule download_sample per sample. Moreover, if you wish it can run these rules in parallel.

Upvotes: 1

Related Questions