how to escape MissingOutputException while running a for loop in a rule in snakemake

Question

I have the following snakefile:

import os

SAMPLES = [i.replace('sample_', '') for i in os.listdir('final_raw')]

rule all:
    input:
        expand('analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq', sample = SAMPLES)
        
        
rule trimming:
    input: 
        read1 = 'final_raw/sample_{sample}/read1_sample_{sample}.fastq.gz',
        read2 = 'final_raw/sample_{sample}/read2_sample_{sample}.fastq.gz', 
        barcodes = 'final_bc.fasta'
    output:
        'analysis/trimmed_data/{sample}_read_1.fastq.gz',
        'analysis/trimmed_data/{sample}_read_2.fastq.gz'
    threads: 6
    shell:
        'flexbar -threads {threads} '
        '-r {input.read1} '
        '-p {input.read2} '
        '-aa Nextera -br {input.barcodes} '
        '-t analysis/trimmed_data/{wildcards.sample}_read '
        '-z GZ'
        
        
rule create_hybrid_genome:
    input:
        human = 'genomes/human/GRCh38.fna',
        covid = 'genomes/cov/cov_genome.fna'
    output:
        'genomes/hybrid/hybrid.fna'
    shell:
        'cat {input.human} {input.covid} > {output}'
        
        
rule hybrid_indexing:
    input:
        rules.create_hybrid_genome.output
    output:
        'genomes/hybrid/hybrid.fna.amb'
    shell:
        'bwa index {input}'
        
        
rule map_hybrid:
    input:
        read1 = 'analysis/trimmed_data/{sample}_read_1.fastq.gz',
        read2 = 'analysis/trimmed_data/{sample}_read_2.fastq.gz',
        genome = 'genomes/hybrid/hybrid.fna'
    output:
        'analysis/results/mapped/hybrid/{sample}.sorted.bam'
    shell:
        'bwa mem -t 8 {input.genome} {input.read1} {input.read2} | '
        'samtools sort -@ 2 -o {wildcards.sample}.sorted.bam && '
        'samtools index -@ 8 {wildcards.sample}.sorted.bam && '
        'mv {wildcards.sample}.sorted.* analysis/results/mapped/hybrid'
        
        
rule unmapped_hybrid:
    input: 
        rules.map_hybrid.output
    output:
        'analysis/results/unmapped/hybrid/um_{sample}.bam'
    shell:
        'samtools view -b -f 4 {input} > um_{wildcards.sample}.bam && '
        'mv um_{wildcards.sample}.bam analysis/results/unmapped/hybrid'
        
        
rule fastq_hybrid:
    input:
        rules.unmapped_hybrid.output
    output:
        read1 = 'analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq',
        read2 = 'analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq'
    shell:
        'samtools fastq '
        '-1 {output.read1} '
        '-2 {output.read2} '
        '-0 /dev/null '
        '-s /dev/null '
        '-n {input}'

I am trying to add a rule 'kraken' that runs the following for loop:

for FILE in $(ls analysis/results/unmapped/hybrid/read1_*_hy.fastq | sed 's/read1_*_hy.fastq//'); do \
        kraken2 --db {input.database} \
        --memory-mapping \
        --threads 8 \
        --use-names \
        --report analysis/results/kraken2/hybrid/{{}}${{FILE}}.kraken \
        --paired read1_${{FILE}}_hy.fastq read2_${{FILE}}_hy.fastq \
        --unclassified-out /analysis/results/unclassified/hybrid/{{}}${{FILE}}_uc#.fastq; done

the for loop is responsible for running all .fastq files at once for the kraken database to be stored in memory and run the classification for all samples. how can this be achieved without referring to the regular snakemake shell that takes wildcards and runs the command for each sample?

the normal way is to have the following:

rule kraken2:
    input:
        database = 'kraken2_db'
        read1 = 'analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq'
        read2 = 'analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq'
    output:
        'analysis/results/kraken2/hybrid/{sample}.kraken'
    shell:
        'kraken2 '
            '--db {input.database} '
            '--threads 8 '
            '--paired '
            '--use-names '
            '--unclassified-out {wildcards.sample}_uc#.fq '
            '--report {output} '
            '{input.read1} {input.read2}'

and rule all updated to the following:

rule all:
    input:
        expand('analysis/results/kraken2/hybrid/{sample}.kraken', sample = SAMPLES)

this will run each sample at a time and I am struggeling with running the for loop in the shell, and most of the time getting the error: MissingInputException in rule all in line 5 of /home/stud9/NAS/snakefile: Missing input files for rule all: affected files:

any suggestions?

Giang Le · Accepted Answer

I am not sure your loop script is working correctly. However, I will try to answer your question to run all fastq samples together.

ls analysis/results/unmapped/hybrid/

Here are my example files

read1_sample1_hy.fastq  read2_sample1_hy.fastq
read1_sample2_hy.fastq  read2_sample2_hy.fastq

To run all the samples at once you can use expand() in the rule.

rule all:
    input:
        'analysis/results/kraken2/hybrid/merged_samples.kraken'

rule kraken2:
    input:
        read1 = expand("analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq",sample = SAMPLES),
        read2 = expand("analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq", sample = SAMPLES)
    output:
        "analysis/results/kraken2/hybrid/merged_samples.kraken"
    shell:
        """
        echo kraken2 {input.read1} {input.read2} > {output}
        """

Snakemake will run all the read1 files and all read2 files together.

how to escape MissingOutputException while running a for loop in a rule in snakemake

Answers (2)

Related Questions