Syntax for merging technical replicates in Snakefile and config file

Question

How can I specify inputs for technical replicates in the snakemake config file, where the goal is aligning paired end reads to a reference genome, and then merging alignment files of replicates into a single file for each individual? My files are named according to the pattern raw-fastq/sampleX_groupX_RX.fq.gz where each sample has multiple groups, and each group has R1 and R2. It sounds like the solution might be to write an input function that creates the list of proper inputs based on the config file, but it doesn't seem like there is a standard way of doing this, and being quite new to python I am not sure how to approach this. In my case there are exactly two groups for each sample, so I tried to use this simple config.yaml:

samples:
  - sample1
  - sample2

groups:
  - group1
  - group2

Here is the workflow I would like to use, or something similar:

configfile: "config.yaml"

rule all:
    input: 
        expand("merged/{sample}.bam", sample=config["samples"])

rule cutadapt:
    input:
        expand(['raw-fastq/{sample}_{group}_R1.fq.gz', 'raw-fastq/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
    output:
        read1 = temp("trimmed_reads/{sample}_{group}_R1.fq.gz"),
        read2 = temp("trimmed_reads/{sample}_{group}_R2.fq.gz")
    shell:
        "cutadapt -e 0.2 -O 5 -a AAGTCGGX -A AAGTCGGX -o {output.read1} -p {output.read2} {input.read1} {input.read2}"

rule bwa_mem:
    input:
        expand(['trimmed_reads/{sample}_{group}_R1.fq.gz', 'trimmed_reads/{sample}_{group}_R2.fq.gz'], sample=config["samples"], group=config["groups"])
    output:
        temp("mapped/{sample}_{group}.unsorted.sam")
    params:
        genome = "/path-to-genome"
    log:
        "mapped/log/{sample}_{group}_bwa_mem.log"
    benchmark:
        "benchmarks/bwa_mem/{sample}_{group}.txt"
    threads: 8
    shell:
        "bwa mem -R '@RG	ID:{wildcards.sample}_{wildcards.group}	SM:{wildcards.sample}' -t {threads} {params.genome} {input} 2> {log} > {output}"

rule samtools_sort:
    input: 
        expand("mapped/{sample}_{group}.unsorted.bam", sample=config["samples"], group=config["groups"]
    output:
        "mapped/{sample}_{group}.sorted.bam"
    shell:
        "samtools sort {input} > {output}"

rule samtools_merge:
    input: 
        expand(['mapped/{sample}_group1.sorted.bam', 'mapped/{sample}_group2.sorted.bam'], sample=config["samples"])
    output:
        protected("merged/{sample}.merged.bam")
    shell:
        "samtools merge {output} {input}"

I would like these rules to generate, for example, the single file merged/sample1.bam from the four input files

raw-fastq/sample1_group1_R1.fq
raw-fastq/sample1_group1_R2.fq
raw-fastq/sample1_group2_R1.fq
raw-fastq/sample1_group2_R2.fq

and likewise for each sample.

Running snakemake -np doesn't throw any errors, but I can tell there is a problem with the way I'm trying to use the expand function: the input is taken to a list of all the possible combinations, rather than iterating through the list of samples and groups. For example, the DAG looks like:

so I can tell there is a problem with the way I'm trying to specify the inputs , but I don't know the proper way of doing this. I would like to know (1) what the proper syntax is for specifying the paired file input in which there are multiple wildcards per filename, in both the config.yaml file and Snakefile, and (b) how to deal with the situation of merging multiple inputs from replicates into a single output file during the samtools_merge rule.

Syntax for merging technical replicates in Snakefile and config file

Answers (1)

Related Questions