Franz
Franz

Reputation: 35

How to process spacific pairs of files with Snakemake?

I am a bit of a newbie with Snakemake and I am currently banging my head on how to process specific pairs of files with a Snakemake pipeline (apologies if the question is lame).
I have a set of fastq files from tumor and matched normal samples. Each tumor sample will have to be processed together with its matched normal one when performing variant calling etc. So, I have prepared a config file with samples listed as follows:

sample_list:
  - sample: 1
    tumor: AO1_04_RN_1_T_4_S4
    control: AO2_07_C007558T1Wa_S37
  - sample: 2
    tumor: AO2_01_C007589T1FTa_S2
    control: AO2_07_C007589T1Wa_S34
  - sample: 3
    tumor: AO9_09_FM_1_T_7_S13 
    control: AO2_07_C007558T1Wa_S37

The first issue I am encountering, however, is that these samples come with 4 fastq.gz files each, which I need to concatenate into 2 - for example, files for tumor sample AO1_04_RN_1_T_4_S4 are:

- AO1_04_RN_1_T_4_S4_L001_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L001_R2_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R2_001.fastq.gz

and I would need a first rule (say rule concat_fastq) to simply concatenate, for each sample, the R1 and the R2 files into two separate fastq AO1_04_RN_1_T_4_S4_R1.fastq.gz and AO1_04_RN_1_T_4_S4_R2.fastq.gz, as I will need to provide these as inputs for alignment.

At the moment, I am finding that specifying inputs like this:

r1 = expand("{path}/{sample}_L{num}_R1_001.fastq.gz",
        path = config["input_path"],
        num = ["001","002"],
    sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]]),
r2 = expand("{path}/{sample}_L{num}_R2_001.fastq.gz",
        path = config["input_path"],
        num = ["001","002"],
    sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]])

obviously concatenates together too many files each time, while something like this, without using expand:

r1 = ["{path}/{sample}_L001_R1_001.fastq.gz", "{path}/{sample}_L002_R1_001.fastq.gz"],
r2 = ["{path}/{sample}_L001_R2_001.fastq.gz", "{path}/{sample}_L002_R2_001.fastq.gz"]

results in the following error:

Wildcards in input files cannot be determined from output files:
'path'

as I am unable to specify what {path} is.

Can someone please provide some advice on how to deal with paired files using Snakemake? I also hope my config file is structured correctly to then process tumor-normal pairs.

Thanks very much for your help

Upvotes: 0

Views: 93

Answers (1)

Cade
Cade

Reputation: 256

Using the rule all to setup the sample wildcards simplifies this a bit. Then in concat_fastq we use expand again to get the two lanes for each read pair. I may also suggest using a CSV to specify samples and metadata, as a CSV is easier to parse (imo).

configfile: "config.yaml"


TUMORS = [d["tumor"] for d in config["sample_list"]]
CONTROLS = [d["control"] for d in config["sample_list"]]
SAMPLES = TUMORS + CONTROLS
READS_PATH = "data/reads"


rule all:
    input:
        reads=expand(
            "results/concat_fastq/{sample}_R{num}.fastq.gz", sample=SAMPLES, num=[1, 2]
        ),


rule concat_fastq:
    input:
        # Use {{ }} to escape sample wildcard in expand.
        r1=expand(READS_PATH + "{{sample}}_L00{lane}_R1.fastq.gz", lane=[1, 2]),
        r2=expand(READS_PATH + "{{sample}}_L00{lane}_R2.fastq.gz", lane=[1, 2]),
    output:
        r1="results/concat_fastq/{sample}_R1.fastq.gz",
        r2="results/concat_fastq/{sample}_R2.fastq.gz",
    shell:
        """
        cat {input.r1} > {output.r1}
        cat {input.r2} > {output.r2}
        """

Upvotes: 1

Related Questions