Reputation: 5
I have a problem with deriving variables from input file names - especially if you want to split based on a delimiter. I have tried different approaches (which I can't get to work) and the only one that works so far, fails in the end, because its looking for all possible variations of variables (and hence input files that don't exist).
My problem - the input files are named in the following pattern:
18AR1376_S57_R2_001.fastq.gz
My initial definition of variables at the beginning:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")
but that ends up with my files all being named 18AR1376_S57
subsequently and I'd like to remove _S57 (which refers to the sample sheet id).
An approach that I have found while searching and which works is this:
SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}
but it looks for all possible combinations of sample and sheetid, and hence looks for input files that don't exist.
I then tried a more basic python approach:
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")
ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)``
but that doesn't work at all
and then I tried keeping my original wildcard glob
SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")
but define new outputfile names (based on instructions I found in another forum) - but that gives me a syntax error that I can't figure out.
rule trim:
input:
R1 = "../run_links/{sample}_R1_001.fastq.gz",
R2 = "../run_links/{sample}_R2_001.fastq.gz"
params:
prefix=lambda wildcards, output:
output:
R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])], #or the below version
R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")],
R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")],
R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")]
resources:
cpus=8
log:
"../logs/trim_{sample}.log"
conda:
"envs/trim.yaml"
shell:
"""
trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50
"""
So, two options,
_
of {sample}Does anybody have tips on how to approach this? Thanks
Thanks!
Upvotes: 0
Views: 396
Reputation: 3368
Instead of using glob_wildcards
, I would use a simple python dictionary to define your sample names attached to the fastq files:
import os
import re
d = dict()
fastqPath = "."
for fastqF in [f for f in os.listdir(fastqPath) if(re.match(r"^[\w-]+_R1_001\.fastq\.gz", f))]:
s = re.search(r"(^[\w-]+)_(S\d+)_R1_(001.fastq.gz)", fastqF)
samplename = s.group(1)
fastqFfile = os.path.join(fastqPath, fastqF)
fastqRfile = os.path.join(fastqPath, s.group(1) + "_" + s.group(2) + "_R2_" + s.group(3))
if(os.path.isfile(fastqRfile)):
d[samplename] = {"read1":os.path.abspath(fastqFfile),"read2":os.path.abspath(fastqRfile)}
The fastq input files are then really easy to use:
rule all:
input:
expand("output/trimmed/{sample}_R1_trim_PE.fastq",sample=d)
rule trim:
input:
R1 = lambda wildcards: d[wildcards.sample]["read1"],
R2 = lambda wildcards: d[wildcards.sample]["read2"]
output:
R1_Pair="output/trimmed/{sample}_R1_trim_PE.fastq",
R1_Sing="output/trimmed/{sample}_R1_trim_SE.fastq",
R2_Pair="output/trimmed/{sample}_R2_trim_PE.fastq",
R2_Sing="output/trimmed/{sample}_R2_trim_SE.fastq"
resources:
cpus=8
log:
"../logs/trim_{sample}.log"
conda:
"envs/trim.yaml"
shell:
"""
trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 I>
"""
N.B: I removed the unused params
section in your rule trim
.
Upvotes: 1