LucasCortes
LucasCortes

Reputation: 29

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:

refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']

def getsamples():
    import glob
    test = (glob.glob("*.fastq"))
    print(test)
    samples = []
    for i in test:
        samples.append(i.rsplit('_', 2)[0])
    return(samples)

def getbarcodes():
    with open('unique.barcodes.txt') as file:
        lines = [line.rstrip() for line in file]
    return(lines)

rule all:
    input:
        expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
        expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
    wildcard_constraints:
        barcodes="[a-z-A-Z]+$"


rule fastq_grep:
    input:
        R1 = "{sample}_R1_001.fastq",
        R2 = "{sample}_R2_001.fastq"
    output:
        out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
        out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
    
    wildcard_constraints:
        barcodes="[a-z-A-Z]+$"
    shell:
        "fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"

The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.

I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"

Upvotes: 0

Views: 143

Answers (2)

dariober
dariober

Reputation: 9062

In addition to @euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:

wildcard_constraints:
    barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
    sample='|'.join([re.escape(x) for x in getsamples()]),

now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

Upvotes: 0

euronion
euronion

Reputation: 1277

There are two issues:

  1. You have an impossible wildcard_constraints defined for {barcode}
  2. Your two wildcards {barcode} and {sample} are competing with each other.

Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:

wildcard_constraints:
    barcodes="[A-Z]+",
    sample="Well.*",

The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.

The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.

A note of advice: I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like grepped/{barcode}/{sample}_R1_001.plate.fastq.

Full suggested Snakefile (formatted using snakefmt)

wildcard_constraints:
    barcodes="[A-Z]+",
    sample="Well.*",


refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]


def getsamples():
    import glob

    test = glob.glob("*.fastq")
    print(test)
    samples = []
    for i in test:
        samples.append(i.rsplit("_", 2)[0])
    return samples


def getbarcodes():
    with open("unique.barcodes.txt") as file:
        lines = [line.rstrip() for line in file]
    return lines


rule all:
    input:
        expand(
            "grepped/{barcodes}{sample}_R1_001.plate.fastq",
            barcodes=getbarcodes(),
            sample=getsamples(),
        ),
        expand(
            "grepped/{barcodes}{sample}_R2_001.plate.fastq",
            barcodes=getbarcodes(),
            sample=getsamples(),
        ),


rule fastq_grep:
    input:
        R1="{sample}_R1_001.fastq",
        R2="{sample}_R2_001.fastq",
    output:
        out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
        out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
    shell:
        "fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"

Upvotes: 3

Related Questions