Guillaume Corre
Guillaume Corre

Reputation: 11

snakemake : index config file with wildcard in output of rule

My objective is to make a rule that generate the genome index for read alignment based on the organism indicated in a sample information csv file.

Each library can be human or mouse (or other) and I would like the pipeline to be as universal as possible.

My pipeline starts with:

configfile: "config.yml"

samples = pd.read_table(config["sampleInfo_path"],sep=";").set_index("sampleName", drop=False)

to collect sample information and load the config file.

SAMPLE info file:

sampleName organism
sampleA mouse
sampleB human

The path to the reference genome is indicated in the config file:

yaml CONFIG file:

genome:
  human:
    fasta: "/media/References/Human/Genecode/GRch38/Sequences/GRCh38.primary_assembly.genome.fa"
    index: "/media/References/Human/Genecode/GRch38/Indexes/Bowtie2/GRCh38.primary_assembly.genome" # path to index created during the run if not existing yet
    annotation: "/media/References/Human/Genecode/GRch38/Annotations/gencode.v46.annotation.gtf.gz"

So for each sampleName, I want to pick its organism in the sampleInfo file and then use this value to extract the path to the fasta file corresponding to the organism in the config file. The yaml nested path would look like :

config['genome'][organism_value_extracted]['fasta']

The snakemake rule looks like this:

rule index:
    input: lambda wildcards: config["genome"][samples["organism"][wildcards.sample]]["fasta"]
    output: config['genome'][samples["organism"]["{sample}"]["index"]
    shell: """
        bowtie2 ... {input} {output
        """

Unfortunatly, I cannot make the output works.

Using config['genome']["human]["index"] it works like a charm but impossible to substitute "human" by the value from samples["organism"][wildcards.sample]

I tried, different syntaxes,lambda or functions but this don't work in output.

My snakemake version is 8.20.5

Thanks for any help you could provide.

Upvotes: 1

Views: 32

Answers (1)

kEks
kEks

Reputation: 588

This is not exactly what you ask for and maybe someone will step with the "right" answer. But I think the canonical way to solve your problem is to not define the output in your config. If you do not have that restriction the snakefile becomes very simple and you will get a more reproducible results, since the file-tree of your output is consistent.

import pandas as pd    
configfile: "config.yml"    
samples = pd.read_table(config["sampleInfo_path"],sep=";").set_index("sampleName", drop=False)


rule all:
    input:
        expand("index/{organism}/{organism}_index.genome", organism=["mouse", "human"])

rule index:
    input: 
        lambda wc: config["genome"][wc.organism]["fasta"]
    output: 
        "index/{organism}/{organism}_index.genome"
    shell: 
        """
        touch {output}
        """

Upvotes: 1

Related Questions