JunhuiLi
JunhuiLi

Reputation: 107

use directories or all files in directories as input in snakemake

I am new to snakemake. I want to use directories or all files in directories as input in snakemake. For example, two directories with different no. of bam files,

--M1
    M1-1.bam
    M1-2.bam
--M2
    M2-3.bam
    M2-5.bam

I just want to merge M1-1.bam, M1-2.bam to M1.bam; M2-3.bam, M2-5.bam to M2.bam; I tried to use wildcards and expand followd by this and this, and the code is as follows,

config.yaml

SAMPLES:
  M1:
    - 1
    - 2
  M2:
    - 3
    - 5
rawdata: path/to/rawdata
outpath: path/to/output
reference: path/to/reference

snakemake file

configfile:"config.yaml"
SAMPLES=config["SAMPLES"]
REFERENCE=config["reference"]
RAWDATA=config["rawdata"]
OUTPATH=config["outpath"]

ALL_INPUT = []
for key, values in SAMPLES.items():
    ALL_INPUT.append(f"Map/bwa/merge/{key}.bam")
    ALL_INPUT.append(f"Map/bwa/sort/{key}.sort.bam")
    ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.bam")
    ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.matrix")
    ALL_INPUT.append(f"SNV/Mutect2/result/{key}.vcf.gz")
    ALL_INPUT.append(f"Map/bwa/result/{key}")
    for value in values:
        ALL_INPUT.append(f"Map/bwa/result/{key}/{key}-{value}.bam")
        for num in {1,2}:
            ALL_INPUT.append(f"QC/fastp/{key}/{key}-{value}.R{num}.fastq.gz")

rule all:
    input:
        expand("{outpath}/{all_input}",all_input=ALL_INPUT,outpath=OUTPATH)
        
rule fastp:
    input:
        r1= RAWDATA + "/{key}-{value}.R1.fastq.gz",
        r2= RAWDATA + "/{key}-{value}.R2.fastq.gz"
    output:
        a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
        a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
    params:
        prefix="{outpath}/QC/fastp/{key}/{key}-{value}"
    shell:
        """
        fastp -i {input.r1} -I {input.r2} -o {output.a1} -O {output.a2} -j {params.prefix}.json -h {params.prefix}.html
        """

rule bwa:
    input:
        a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
        a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
    output:
        o1="{outpath}/Map/bwa/result/{key}/{key}-{value}.bam"
    params:
        mem="4000",
        rg="@RG\\tID:{key}\\tPL:ILLUMINA\\tSM:{key}"
    shell:
        """
        bwa mem -t {threads} -M -R '{params.rg}' {REFERENCE} {input.a1} {input.a2} | samtools view -b -o {output.o1}
        """

## get sample index from raw fastq
key_ids,value_ids = glob_wildcards(RAWDATA + "/{key}-{value}.R1.fastq.gz")
# remove duplicate sample name, and this is useful when there is only one sample input
key_ids = list(set(key_ids))

rule merge:
    input:
        expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
    output:
        "{outpath}/Map/bwa/merge/{key}.bam"
    shell:
        """
        samtools merge {output} {input}
        """

The {input} in merge command will be

M1-1.bam M1-2.bam M1-3.bam M1-5.bam M2-1.bam M2-2.bam M2-3.bam M2-5.bam

Actually, for M1 sample, the {input} should be M1-1.bam M1-2.bam; for M2, M2-3.bam M2-5.bam. I also read this, but I have no idea if there are lots of directories with different files each.

Then I tried to use directories as input, for merge rule

rule mergebam:
    input:
        "{outpath}/Map/bwa/result/{key}"
    output:
        "{outpath}/Map/bwa/merge/{key}.bam"
    log:
        "{outpath}/Map/bwa/log/{key}.merge.bam.log"
    shell:
        """
        samtools merge {output} `ls {input}/*bam` > {log} 2>&1
        """

But this give me MissingInputException error

Missing input files for rule merge:
/{outpath}/Map/bwa/result/M1

Any idea will be appreciated.

Upvotes: 1

Views: 1742

Answers (2)

Rui Li
Rui Li

Reputation: 41

if you change your config.yaml to the following, can you make the implementation easier by using expand?

SAMPLES:
  M1:
    - M1-1
    - M2-2
  M2:
    - M2-3
    - M2-5

Upvotes: 2

dariober
dariober

Reputation: 9062

I haven't fully parsed your question but I'll give it a shot anyway... In rule merge you have:

expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)

This means that you collect all combinations of outpath, key and value.

Presumably you want all combinations of value within each outpath and key. So use:

expand("{{outpath}}/Map/bwa/result/{{key}}/{{key}}-{value}.bam",  value=value_ids)

Upvotes: 2

Related Questions