Anna
Anna

Reputation: 1

Snakemake SyntaxError when using a directory as output

I would like to run a command (chipseq-greylist) which outputs three files each time it is run with one input file. The names of the output files are automatically selected by the command. An example is:

chipseq-greylist --outdir out_dir A.bam

This line will produce three output files: A-greystats.csv, A-greydepth.tsv and A-grey.bed. I am interested in collecting all *-grey.bed files into one directory to use later on.

Since this is part of a pipeline that I use on many files, I am using Snakemake to process all these jobs. I understand that it is possible to specify a directory as output (https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#directories-as-outputs) and this will perfectly suit my requirements. However, when I made the rule with a directory as output I get a SyntaxError.

from os.path import join

# Globals ---------------------------------------------------------------------

DIR = 'bowtie2/'

# A Snakemake regular expression matching BAM files.
SAMPLES, = glob_wildcards(join(DIR, '{sample,SRR\d+}_sorted_filtered.bam'))

# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand("test/{sample}_sorted_filtered.bam.bai", sample=SAMPLES)

rule samtools_index:
    input:
        "bowtie2/{sample}_sorted_filtered.bam"
    output:
        "test/{sample}_sorted_filtered.bam.bai"
    log:
        "log/test/{sample}.log"
    shell:
        "samtools index {input} &> {log} > {output}"

rule greylist_call:
    input:
        "bowtie2/{sample}_sorted_filtered.bam"
    output:
        directory("greylist")
    log:
        "log/greylist/{sample}.log"
    shell:
        "chipseq-greylist --outdir {output} {input} &> {log}"

Note that in my example I had to add another step to include a target file for the rule all. This is another problem I find with the use of directory() since I cannot find a way to specify a target for rule all.

The error I get is:

SyntaxError: Not all output, log and benchmark files of rule greylist_call contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file. File "Snakefile", line 34, in <module>

I can get rid of the error message if I add wildcards to the output directory name in rule greylist_call.

from os.path import join

# Globals ---------------------------------------------------------------------

DIR = 'bowtie2/'

# A Snakemake regular expression matching the forward mate FASTQ files.
SAMPLES, = glob_wildcards(join(DIR, '{sample,SRR\d+}_sorted_filtered.bam'))

# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand("test/{sample}_sorted_filtered.bam.bai", sample=SAMPLES)

rule samtools_index:
    input:
        "bowtie2/{sample}_sorted_filtered.bam"
    output:
        "test/{sample}_sorted_filtered.bam.bai"
    log:
        "log/test/{sample}.log"
    shell:
        "samtools index {input} &> {log} > {output}"

rule greylist_call:
    input:
        "bowtie2/{sample}_sorted_filtered.bam"
    output:
        directory("greylist_{sample}")
    log:
        "log/greylist/{sample}.log"
    shell:
        "chipseq-greylist --outdir {output} {input} &> {log}"

However, I feel this defeats the purpose of having all the files in the same directly and although I could potentially work around that problem, I still cannot find a way to force snakemake to run this rule as I am unable to specify a target file in rule all.

I would appreciate any help regarding the use of directory in snakemake or any other suggestions of how to improve this set of rules.

Thanks Anna

Upvotes: 0

Views: 2209

Answers (1)

Eric C.
Eric C.

Reputation: 3368

I do not understand why you define a folder as an output. There is no way snakemake can determine that your shell command worked as expected if you do not give the real output files of the command. Thus, if you want to produce all the different sample files in the same directory, this can be done simply with the help of params :

rule greylist_call:
    input:
        "bowtie2/{sample}_sorted_filtered.bam"
    output:
        "output_dir/{sample}-greystats.csv",
        "output_dir/{sample}-greydepth.tsv",
        "output_dir/{sample}-grey.bed"
    params:
        outputDir = "output_dir"
    log:
        "log/greylist/{sample}.log"
    shell:
        "chipseq-greylist --outdir {params.outputDir} {input} &> {log}"

and in your rule all, you add:

expand("output_dir/{sample}-grey.bed", sample=SAMPLES)

This will produce all files in the same directory.

If you want to collect all bed files in this directory afterwards in another rule, you'll define an input like:

input:
    expand("output_dir/{sample}-grey.bed",sample=SAMPLES)

To explain the error you get: Wildcards are defined from the outputs in a rule. If you ask for file xxx.bed in rule all, snakemake will find the rule that has an output either like {wildcard}.bed or the file xxx.bed itself. Once the rule found, snakemake will then be able to apply the wildcards to the input, log, params etc... To summarize, you cannot have a wildcard in input, log or params (or even shell commands) if it is not defined in the output. This is why it worked with your rule that had directory("greylist_{sample}") as output eventhough the rule did not work as you wanted!

Upvotes: 1

Related Questions