Mike D
Mike D

Reputation: 747

Input Wildcard Constraints Snakemake

I am new to snakemake, and I am using it to merge steps from two pipelines into a single larger pipeline. The issue that several steps create similarly named files, and I cannot find a way to limit the wildcards, so I am getting a Missing input files for rule error on one step that I just cannot resolve.

The full snakefile is long, and is available here: https://mdd.li/snakefile

The relevant sections are (sections of the file are missing below):

wildcard_constraints:
    sdir="[^/]+",
    name="[^/]+",
    prefix="[^/]+"

# Several mapping rules here

rule find_intersecting_snps:
    input:
        bam="hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.bam"
    params:
        hornet=os.path.expanduser(config['hornet']),
        snps=config['snps']
    output:
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq1.gz",
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.remap.fq2.gz",
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.bam",
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.to.remap.num.gz"
    shell:
        dedent(
            """\
            python2 {params.hornet}/find_intersecting_snps.py \
            -p {input.bam} {params.snps}
            """
        )

# Several remapping steps, similar to the first mapping steps, but in a different directory

rule wasp_merge:
    input:
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
        "hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.remap.kept.bam"
    output:
        "hic_mapping/wasp_results/{sdir}_{prefix}_filt_hg19.bwt2pairs.filt.bam"
    params:
        threads=config['job_cores']
    shell:
        dedent(
            """\
            {module}
            module load samtools
            samtools merge --threads {params.threads} {output} {input}
            """
        )

# All future steps use the name style wildcard, like below

rule move_results:
    input:
        "hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.filt.bam"
    output:
        "hic_mapping/wasp_results/{name}_filt_hg19.bwt2pairs.bam"
    shell:
        dedent(
            """\
            mv {input} {output}
            """
        )

This pipeline is essentially doing some mapping steps in one directory structure that looks like hic_mapping/bowtie_results/bwt2/<subdir>/<file>, (where subdir is three different directories) then filtering the results, and doing another almost identical mapping step in hic_remap/bowtie_results/bwt2/<subdir>/<file>, before merging the results into an entirely new directory and collapsing the subdirectories into the file name: hic_mapping/wasp_results/<subdir>_<file>.

The problem I have is that the wasp_merge step breaks the find_intersecting_snps step if I collapse the subdirectory name into the filename. If I just maintain the subdirectory structure, everything works fine. Doing this would break future steps of the pipeline though.

The error I get is:

MissingInputException in line 243 of /godot/quertermous/PooledHiChip/pipeline/Snakefile:
Missing input files for rule find_intersecting_snps:
hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam

The correct file is: hic_mapping/bowtie_results/bwt2/HCASMC5-8/HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_hg19.bwt2pairs.bam

But it is looking for: hic_mapping/bowtie_results/bwt2/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006/001_hg19.bwt2pairs.bam

Which is not created anywhere, nor defined by any rule. I think it is somehow getting confused by the existence of the file created by the wasp_merge step: hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.filt.bam

Or possibly a downstream file (after the target that creates this error): hic_mapping/wasp_results/HCASMC5-8_HCASMC-8-CAGAGAGG-TACTCCTT_S8_L006_001_filt_hg19.bwt2pairs.bam

However, I have no idea why either of those files would confuse the find_intersecting_snps rule, because the directory structures are totally different.

I feel like I must be missing something obvious, because this error is so absurd, but I cannot figure out what it is.

Upvotes: 2

Views: 2618

Answers (1)

Mike D
Mike D

Reputation: 747

The problem is that both the directory name and the file name contain underscores, and in the final file name I separate the two components by underscores.

By either changing that separation character, or replacing the rule with a python function that get the names from elsewhere, I can solve the issue.

This works:

rule wasp_merge:
    input:
        "hic_mapping/bowtie_results/bwt2/{sdir}/{prefix}_hg19.bwt2pairs.keep.bam",
        "hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.remap.kept.bam"
    output:
        "hic_mapping/wasp_results/{sdir}{prefix}_filt_hg19.bwt2pairs.filt.bam"
    params:
        threads=config['job_cores']
    shell:
        dedent(
            """\
            {module}
            module load samtools
            samtools merge --threads {params.threads} {output} {input}
            """
        )

Upvotes: 2

Related Questions