ch_esr
ch_esr

Reputation: 5

snakemake derive multiple variables from input file names

I have a problem with deriving variables from input file names - especially if you want to split based on a delimiter. I have tried different approaches (which I can't get to work) and the only one that works so far, fails in the end, because its looking for all possible variations of variables (and hence input files that don't exist).

My problem - the input files are named in the following pattern: 18AR1376_S57_R2_001.fastq.gz

My initial definition of variables at the beginning:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")

but that ends up with my files all being named 18AR1376_S57 subsequently and I'd like to remove _S57 (which refers to the sample sheet id).

An approach that I have found while searching and which works is this:
SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}
but it looks for all possible combinations of sample and sheetid, and hence looks for input files that don't exist.

I then tried a more basic python approach:

SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")
ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)``

but that doesn't work at all

and then I tried keeping my original wildcard glob SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") but define new outputfile names (based on instructions I found in another forum) - but that gives me a syntax error that I can't figure out.

rule trim:
    input:
        R1 = "../run_links/{sample}_R1_001.fastq.gz", 
        R2 = "../run_links/{sample}_R2_001.fastq.gz"
    params:
        prefix=lambda wildcards, output: 
    output:
        R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])],  #or the below version
        R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")],
        R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")],
        R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")]
    resources:
        cpus=8
    log:
        "../logs/trim_{sample}.log"
    conda:
        "envs/trim.yaml"
    shell:
        """
        trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50
        """

So, two options,

  1. beginning of workflow: split the sample into sample and sample sheet id
  2. define new output names and use split on delimiter _ of {sample}

Does anybody have tips on how to approach this? Thanks

Thanks!

Upvotes: 0

Views: 396

Answers (1)

Eric C.
Eric C.

Reputation: 3368

Instead of using glob_wildcards, I would use a simple python dictionary to define your sample names attached to the fastq files:

import os
import re

d = dict()

fastqPath = "."
for fastqF in [f for f in os.listdir(fastqPath) if(re.match(r"^[\w-]+_R1_001\.fastq\.gz", f))]:
        s = re.search(r"(^[\w-]+)_(S\d+)_R1_(001.fastq.gz)", fastqF)
        samplename = s.group(1)
        fastqFfile = os.path.join(fastqPath, fastqF)
        fastqRfile = os.path.join(fastqPath, s.group(1) + "_" + s.group(2) + "_R2_" + s.group(3))
        if(os.path.isfile(fastqRfile)):
                d[samplename] = {"read1":os.path.abspath(fastqFfile),"read2":os.path.abspath(fastqRfile)}

The fastq input files are then really easy to use:

rule all:
        input:
                expand("output/trimmed/{sample}_R1_trim_PE.fastq",sample=d)

rule trim:
        input:
                R1 = lambda wildcards: d[wildcards.sample]["read1"],
                R2 = lambda wildcards: d[wildcards.sample]["read2"]
        output:
                R1_Pair="output/trimmed/{sample}_R1_trim_PE.fastq",
                R1_Sing="output/trimmed/{sample}_R1_trim_SE.fastq",
                R2_Pair="output/trimmed/{sample}_R2_trim_PE.fastq",
                R2_Sing="output/trimmed/{sample}_R2_trim_SE.fastq"
        resources:
                cpus=8
        log:
                "../logs/trim_{sample}.log"
        conda:
                "envs/trim.yaml"
        shell:
                """
                trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 I>
                """

N.B: I removed the unused params section in your rule trim.

Upvotes: 1

Related Questions