Dong Zhang
Dong Zhang

Reputation: 51

How to avoid "missing input files" error in Snakemake's "expand" function

I get a MissingInputException when I run the following snakemake code:

import re
import os

glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))

rule end:
    input:
        expand(os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas"), fileName=glob_vars.fileName)

rule rename:
    '''
    rename fasta file to avoid problems
    '''
    input:
        expand("inputs/{{fileName}}.{ext}", ext=glob_vars.ext)
    output:
        os.path.join(os.getcwd(), "inputs", "{fileName}_rename.fas")
    run:
        list_ = []
        with open(str(input)) as f2:
            line = f2.readline()
            while line:
                while not line.startswith('>') and line:
                    line = f2.readline()
                fas_name = re.sub(r"\W", "_", line.strip())
                list_.append(fas_name)
                fas_seq = ""
                line = f2.readline()
                while not line.startswith('>') and line:
                    fas_seq += re.sub(r"\s","",line)
                    line = f2.readline()
                list_.append(fas_seq)
        with open(str(output), "w") as f:
            f.write("\n".join(list_))

My Inputs folder contains these files:

G.bullatarudis.fasta
goldfish_protein.faa
guppy_protein.faa
gyrodactylus_salaris.fasta
protopolystoma_xenopodis.fa
salmon_protein.faa
schistosoma_mansoni.fa

The error message is:

Building DAG of jobs...
MissingInputException in line 10 of /home/zhangdong/works/NCBI/BLAST/RHB/test.rule:
Missing input files for rule rename:
inputs/guppy_protein.fasta
inputs/guppy_protein.fa

I assumed that the error is caused by expand function, because only guppy_protein.faa file exists, but expand also generate guppy_protein.fasta and guppy_protein.fa files. Are there any solutions?

Upvotes: 1

Views: 637

Answers (2)

Dong Zhang
Dong Zhang

Reputation: 51

Thanks to Troy Comi, I modified my code and it worked:

import re
import os
import itertools

speciess,exts = glob_wildcards(os.path.join(os.getcwd(), "inputs_test","{species}.{ext}"))

rule end:
    input:
        expand("inputs_test/{species}_rename.fas", species=speciess)

def required_files(wildcards):
    list_combination = itertools.product([wildcards.species], list(set(exts)))
    exist_file = ""
    for file in list_combination:
        if os.path.exists(f"inputs_test/{'.'.join(file)}"):
            exist_file = f"inputs_test/{'.'.join(file)}"
    return exist_file

rule rename:
    '''
    rename fasta file to avoid problems
    '''
    input:
        required_files
    output:
        "inputs_test/{species}_rename.fas"
    run:
        list_ = []
        with open(str(input)) as f2:
            line = f2.readline()
            while line:
                while not line.startswith('>') and line:
                    line = f2.readline()
                fas_name = ">" + re.sub(r"\W", "_", line.replace(">", "").strip())
                list_.append(fas_name)
                fas_seq = ""
                line = f2.readline()
                while not line.startswith('>') and line:
                    fas_seq += re.sub(r"\s","",line)
                    line = f2.readline()
                list_.append(fas_seq)
        with open(str(output), "w") as f:
            f.write("\n".join(list_))

Upvotes: 0

Troy Comi
Troy Comi

Reputation: 2059

By default, expand will produce all combinations of the input lists, so this is expected behavior. You need your input to lookup the proper extension given a fileName. I haven't tested this:

glob_vars = glob_wildcards(os.path.join(os.getcwd(), "inputs","{fileName}.{ext}"))

# create a dict to lookup extensions given fileNames
glob_vars_dict = {fname: ex for fname, ex in zip(glob_vars.fileName, glob_vars.ext)}

def rename_input(wildcards):
   ext = glob_vars_dict[wildcards.fileName]
   return f"inputs/{wildcards.fileName}.{ext}"

rule rename:
    input: rename_input

A few unsolicited style comments:

  • You don't have to prepend your glob_wildcards with the os.getcwd, glob_wildcards("inputs", "{fileName}.{ext}")) should work as snakemake uses paths relative to the working directory by default.
  • Try to stick with snake_case instead of camalCase for your variable names in python
  • In this case, fileName isn't a great descriptor of what you are capturing. Maybe species_name or species would be clearer

Upvotes: 1

Related Questions