Reputation: 165
How can I specify inputs for technical replicates in the snakemake config file, where the goal is aligning paired end reads to a reference genome, and then merging alignment files of replicates into a single file for each individual? My files are named according to the pattern raw-fastq/sampleX_groupX_RX.fq.gz
where each sample has multiple groups, and each group has R1 and R2. It sounds like the solution might be to write an input function that creates the list of proper inputs based on the config file, but it doesn't seem like there is a standard way of doing this, and being quite new to python I am not sure how to approach this. In my case there are exactly two groups for each sample, so I tried to use this simple config.yaml
:
samples:
- sample1
- sample2
groups:
- group1
- group2
Here is the workflow I would like to use, or something similar:
configfile: "config.yaml"
rule all:
input:
expand("merged/{sample}.bam", sample=config["samples"])
rule cutadapt:
input:
expand(['raw-fastq/{sample}_{group}_R1.fq.gz', 'raw-fastq/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
output:
read1 = temp("trimmed_reads/{sample}_{group}_R1.fq.gz"),
read2 = temp("trimmed_reads/{sample}_{group}_R2.fq.gz")
shell:
"cutadapt -e 0.2 -O 5 -a AAGTCGGX -A AAGTCGGX -o {output.read1} -p {output.read2} {input.read1} {input.read2}"
rule bwa_mem:
input:
expand(['trimmed_reads/{sample}_{group}_R1.fq.gz', 'trimmed_reads/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
output:
temp("mapped/{sample}_{group}.unsorted.sam")
params:
genome = "/path-to-genome"
log:
"mapped/log/{sample}_{group}_bwa_mem.log"
benchmark:
"benchmarks/bwa_mem/{sample}_{group}.txt"
threads: 8
shell:
"bwa mem -R '@RG\tID:{wildcards.sample}_{wildcards.group}\tSM:{wildcards.sample}' -t {threads} {params.genome} {input} 2> {log} > {output}"
rule samtools_sort:
input:
expand("mapped/{sample}_{group}.unsorted.bam", sample=config["samples"], group=config["groups"]
output:
"mapped/{sample}_{group}.sorted.bam"
shell:
"samtools sort {input} > {output}"
rule samtools_merge:
input:
expand(['mapped/{sample}_group1.sorted.bam', 'mapped/{sample}_group2.sorted.bam'], sample=config["samples"])
output:
protected("merged/{sample}.merged.bam")
shell:
"samtools merge {output} {input}"
I would like these rules to generate, for example, the single file merged/sample1.bam
from the four input files
raw-fastq/sample1_group1_R1.fq
raw-fastq/sample1_group1_R2.fq
raw-fastq/sample1_group2_R1.fq
raw-fastq/sample1_group2_R2.fq
and likewise for each sample.
Running snakemake -np
doesn't throw any errors, but I can tell there is a problem with the way I'm trying to use the expand function: the input is taken to a list of all the possible combinations, rather than iterating through the list of samples and groups. For example, the DAG looks like:
so I can tell there is a problem with the way I'm trying to specify the inputs , but I don't know the proper way of doing this. I would like to know (1) what the proper syntax is for specifying the paired file input in which there are multiple wildcards per filename, in both the config.yaml
file and Snakefile
, and (b) how to deal with the situation of merging multiple inputs from replicates into a single output file during the samtools_merge
rule.
Upvotes: 2
Views: 531
Reputation: 4089
It appears that your yaml
config file is not in proper lists format. They need to be hyphenated. Otherwise, based on your example, it looks like it gets read as a string.
samples:
- sample1
- sample2
groups:
- group1
- group2
In your rule all
, wildcard {sample}
is not defined. What is it supposed to be? Also, to further clarify your question, can you provide an example - what are the expected output filenames for the samples and groups in your config file?
Updated answer:
Problem here is because each rule was using more files than were needed. For example, rule cutadapt
needs to act on only one sample and one group at a time; not all samples and groups. I would recommend walking through snakemake tutorial example for better understanding.
Here is modified code. I haven't tried executing it, but I think this would work.
configfile: "config.yaml"
rule all:
input:
expand("merged/{sample}.bam", sample=config["samples"])
rule cutadapt:
input:
expand('raw-fastq/{{sample}}_{{group}}_R{read}.fq.gz',
read=[1,2])
output:
read1 = temp("trimmed_reads/{sample}_{group}_R1.fq.gz"),
read2 = temp("trimmed_reads/{sample}_{group}_R2.fq.gz")
message:
"Trimming adapters - {wildcards.sample}, {wildcards.group}"
shell:
"cutadapt -e 0.2 -O 5 -a AAGTCGGX -A AAGTCGGX -o {output.read1} -p {output.read2} {input.read1} {input.read2}"
rule bwa_mem:
input:
expand(['trimmed_reads/{{sample}}_{{group}}_R{read}.fq.gz',
read=[1,2])
output:
temp("mapped/{sample}_{group}.unsorted.sam")
params:
genome = "/path-to-genome"
log:
"mapped/log/{sample}_{group}_bwa_mem.log"
benchmark:
"benchmarks/bwa_mem/{sample}_{group}.txt"
threads: 8
shell:
"bwa mem -R '@RG\tID:{wildcards.sample}_{wildcards.group}\tSM:{wildcards.sample}' -t {threads} {params.genome} {input} 2> {log} > {output}"
rule samtools_sort:
input:
"mapped/{sample}_{group}.unsorted.bam"
output:
"mapped/{sample}_{group}.sorted.bam"
shell:
"samtools sort {input} > {output}"
rule samtools_merge:
input:
expand('mapped/{{sample}}_{group}.sorted.bam', group=config['groups'])
output:
protected("merged/{sample}.merged.bam")
shell:
"samtools merge {output} {input}"
Upvotes: 1