Reputation: 11
My objective is to make a rule that generate the genome index for read alignment based on the organism indicated in a sample information csv file.
Each library can be human or mouse (or other) and I would like the pipeline to be as universal as possible.
My pipeline starts with:
configfile: "config.yml"
samples = pd.read_table(config["sampleInfo_path"],sep=";").set_index("sampleName", drop=False)
to collect sample information and load the config file.
SAMPLE info file:
sampleName | organism |
---|---|
sampleA | mouse |
sampleB | human |
The path to the reference genome is indicated in the config file:
yaml CONFIG file:
genome:
human:
fasta: "/media/References/Human/Genecode/GRch38/Sequences/GRCh38.primary_assembly.genome.fa"
index: "/media/References/Human/Genecode/GRch38/Indexes/Bowtie2/GRCh38.primary_assembly.genome" # path to index created during the run if not existing yet
annotation: "/media/References/Human/Genecode/GRch38/Annotations/gencode.v46.annotation.gtf.gz"
So for each sampleName
, I want to pick its organism
in the sampleInfo
file and then use this value to extract the path to the fasta file corresponding to the organism in the config file
. The yaml nested path would look like :
config['genome'][organism_value_extracted]['fasta']
The snakemake rule looks like this:
rule index:
input: lambda wildcards: config["genome"][samples["organism"][wildcards.sample]]["fasta"]
output: config['genome'][samples["organism"]["{sample}"]["index"]
shell: """
bowtie2 ... {input} {output
"""
Unfortunatly, I cannot make the output works.
Using config['genome']["human]["index"]
it works like a charm but impossible to substitute "human" by the value from samples["organism"][wildcards.sample]
I tried, different syntaxes,lambda or functions but this don't work in output.
My snakemake version is 8.20.5
Thanks for any help you could provide.
Upvotes: 1
Views: 32
Reputation: 588
This is not exactly what you ask for and maybe someone will step with the "right" answer. But I think the canonical way to solve your problem is to not define the output in your config. If you do not have that restriction the snakefile becomes very simple and you will get a more reproducible results, since the file-tree of your output is consistent.
import pandas as pd
configfile: "config.yml"
samples = pd.read_table(config["sampleInfo_path"],sep=";").set_index("sampleName", drop=False)
rule all:
input:
expand("index/{organism}/{organism}_index.genome", organism=["mouse", "human"])
rule index:
input:
lambda wc: config["genome"][wc.organism]["fasta"]
output:
"index/{organism}/{organism}_index.genome"
shell:
"""
touch {output}
"""
Upvotes: 1