Ralph matar
Ralph matar

Reputation: 35

how to escape MissingOutputException while running a for loop in a rule in snakemake

I have the following snakefile:

import os

SAMPLES = [i.replace('sample_', '') for i in os.listdir('final_raw')]

rule all:
    input:
        expand('analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq', sample = SAMPLES)
        
        
rule trimming:
    input: 
        read1 = 'final_raw/sample_{sample}/read1_sample_{sample}.fastq.gz',
        read2 = 'final_raw/sample_{sample}/read2_sample_{sample}.fastq.gz', 
        barcodes = 'final_bc.fasta'
    output:
        'analysis/trimmed_data/{sample}_read_1.fastq.gz',
        'analysis/trimmed_data/{sample}_read_2.fastq.gz'
    threads: 6
    shell:
        'flexbar -threads {threads} '
        '-r {input.read1} '
        '-p {input.read2} '
        '-aa Nextera -br {input.barcodes} '
        '-t analysis/trimmed_data/{wildcards.sample}_read '
        '-z GZ'
        
        
rule create_hybrid_genome:
    input:
        human = 'genomes/human/GRCh38.fna',
        covid = 'genomes/cov/cov_genome.fna'
    output:
        'genomes/hybrid/hybrid.fna'
    shell:
        'cat {input.human} {input.covid} > {output}'
        
        
rule hybrid_indexing:
    input:
        rules.create_hybrid_genome.output
    output:
        'genomes/hybrid/hybrid.fna.amb'
    shell:
        'bwa index {input}'
        
        
rule map_hybrid:
    input:
        read1 = 'analysis/trimmed_data/{sample}_read_1.fastq.gz',
        read2 = 'analysis/trimmed_data/{sample}_read_2.fastq.gz',
        genome = 'genomes/hybrid/hybrid.fna'
    output:
        'analysis/results/mapped/hybrid/{sample}.sorted.bam'
    shell:
        'bwa mem -t 8 {input.genome} {input.read1} {input.read2} | '
        'samtools sort -@ 2 -o {wildcards.sample}.sorted.bam && '
        'samtools index -@ 8 {wildcards.sample}.sorted.bam && '
        'mv {wildcards.sample}.sorted.* analysis/results/mapped/hybrid'
        
        
rule unmapped_hybrid:
    input: 
        rules.map_hybrid.output
    output:
        'analysis/results/unmapped/hybrid/um_{sample}.bam'
    shell:
        'samtools view -b -f 4 {input} > um_{wildcards.sample}.bam && '
        'mv um_{wildcards.sample}.bam analysis/results/unmapped/hybrid'
        
        
rule fastq_hybrid:
    input:
        rules.unmapped_hybrid.output
    output:
        read1 = 'analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq',
        read2 = 'analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq'
    shell:
        'samtools fastq '
        '-1 {output.read1} '
        '-2 {output.read2} '
        '-0 /dev/null '
        '-s /dev/null '
        '-n {input}'

I am trying to add a rule 'kraken' that runs the following for loop:

for FILE in $(ls analysis/results/unmapped/hybrid/read1_*_hy.fastq | sed 's/read1_*_hy.fastq//'); do \
        kraken2 --db {input.database} \
        --memory-mapping \
        --threads 8 \
        --use-names \
        --report analysis/results/kraken2/hybrid/{{}}${{FILE}}.kraken \
        --paired read1_${{FILE}}_hy.fastq read2_${{FILE}}_hy.fastq \
        --unclassified-out /analysis/results/unclassified/hybrid/{{}}${{FILE}}_uc#.fastq; done

the for loop is responsible for running all .fastq files at once for the kraken database to be stored in memory and run the classification for all samples. how can this be achieved without referring to the regular snakemake shell that takes wildcards and runs the command for each sample?

the normal way is to have the following:

rule kraken2:
    input:
        database = 'kraken2_db'
        read1 = 'analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq'
        read2 = 'analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq'
    output:
        'analysis/results/kraken2/hybrid/{sample}.kraken'
    shell:
        'kraken2 '
            '--db {input.database} '
            '--threads 8 '
            '--paired '
            '--use-names '
            '--unclassified-out {wildcards.sample}_uc#.fq '
            '--report {output} '
            '{input.read1} {input.read2}'

and rule all updated to the following:

rule all:
    input:
        expand('analysis/results/kraken2/hybrid/{sample}.kraken', sample = SAMPLES)

this will run each sample at a time and I am struggeling with running the for loop in the shell, and most of the time getting the error: MissingInputException in rule all in line 5 of /home/stud9/NAS/snakefile: Missing input files for rule all: affected files:

any suggestions?

Upvotes: 0

Views: 113

Answers (2)

Troy Comi
Troy Comi

Reputation: 2079

Not sure this is the problem, nor why you need to run them at once, but the sed command in your for loop is matching read1_*_hy.fastq, e.g. read1 followed by 0 or more _ followed by _hy.fastq. You probably want it to be

 $(ls analysis/results/unmapped/hybrid/read1_*_hy.fastq | sed 's/read1_(.*)_hy.fastq/\1/')

To get the sample names from the filenames. Check by echoing the FILE and command you hope to run to see if your bash looks good.

Upvotes: 0

Giang Le
Giang Le

Reputation: 150

I am not sure your loop script is working correctly. However, I will try to answer your question to run all fastq samples together.

ls analysis/results/unmapped/hybrid/

Here are my example files

read1_sample1_hy.fastq  read2_sample1_hy.fastq
read1_sample2_hy.fastq  read2_sample2_hy.fastq

To run all the samples at once you can use expand() in the rule.

rule all:
    input:
        'analysis/results/kraken2/hybrid/merged_samples.kraken'

rule kraken2:
    input:
        read1 = expand("analysis/results/unmapped/hybrid/read1_{sample}_hy.fastq",sample = SAMPLES),
        read2 = expand("analysis/results/unmapped/hybrid/read2_{sample}_hy.fastq", sample = SAMPLES)
    output:
        "analysis/results/kraken2/hybrid/merged_samples.kraken"
    shell:
        """
        echo kraken2 {input.read1} {input.read2} > {output}
        """

Snakemake will run all the read1 files and all read2 files together.

Upvotes: 0

Related Questions