Reputation: 69
I've been trying to manually create snakemake wildcards by importing a tab-delimited file that looks as follows:
dataset sample species frr
PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 1 PRJNA493818_GSE120639_SRP162872 SRR7942395_GSM3406786_sAML_Control_1 Homo_sapiens 2 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 1 PRJNA362883_GSE93946_SRP097621 SRR5195524_GSM2465521_KrasT_45649_NoDox Mus_musculus 2
This is how my snakemake file looks like (minimal example):
import pandas as pd
import os
# --- Importing Configuration Files --- #
configfile: "/DATA/config/config.yaml"
table_cols = ['dataset','sample','species','frr']
table_samples = pd.read_table('/DATA/config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR)
rule all:
input:
expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR)
## fastq files quality control
rule rawFastqc:
input:
rawread=config["project_path"]+"resources/raw_datasets/{dataset}/{sample}_{species}_RNA-Seq_{frr}.fastq.gz"
output:
zip=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.zip",
html=config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html"
threads:
12
params:
path=config["project_path"]+"results/{dataset}/rawQC/"
conda:
"envs/bulkRNAseq.yaml"
shell:
"""
fastqc {input.rawread} --threads {threads} -o {params.path}
"""
When I run:
snakemake -s test --use-conda -n -p
This is the output:
['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]
Building DAG of jobs...
Job counts:
count jobs
1 all
4 rawFastqc
5
[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz
output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html
jobid: 3
wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=1
threads: 12
fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz
output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html
jobid: 1
wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=1
threads: 12
fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
input: /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz
output: /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
jobid: 4
wildcards: dataset=PRJNA362883_GSE93946_SRP097621, sample=SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus, species=musculus, frr=2
threads: 12
fastqc /DATA/resources/raw_datasets/PRJNA362883_GSE93946_SRP097621/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/
[Thu Aug 11 00:57:30 2022]
rule rawFastqc:
input: /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz
output: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.zip, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html
jobid: 2
wildcards: dataset=PRJNA493818_GSE120639_SRP162872, sample=SRR7942395_GSM3406786_sAML_Control_1_Homo, species=sapiens, frr=2
threads: 12
fastqc /DATA/resources/raw_datasets/PRJNA493818_GSE120639_SRP162872/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2.fastq.gz --threads 12 -o /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/
[Thu Aug 11 00:57:30 2022]
localrule all:
input: /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA493818_GSE120639_SRP162872/rawQC/SRR7942395_GSM3406786_sAML_Control_1_Homo_sapiens_RNA-Seq_2_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_1_fastqc.html, /DATA/results/PRJNA362883_GSE93946_SRP097621/rawQC/SRR5195524_GSM2465521_KrasT_45649_NoDox_Mus_musculus_RNA-Seq_2_fastqc.html
jobid: 0
Job counts:
count jobs
1 all
4 rawFastqc
5
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It's clear that print(DATASET,SAMPLE,SPECIES,FRR)
produces my desired wildcard values:
['PRJNA493818_GSE120639_SRP162872', 'PRJNA493818_GSE120639_SRP162872', 'PRJNA362883_GSE93946_SRP097621', 'PRJNA362883_GSE93946_SRP097621'] ['SRR7942395_GSM3406786_sAML_Control_1', 'SRR7942395_GSM3406786_sAML_Control_1', 'SRR5195524_GSM2465521_KrasT_45649_NoDox', 'SRR5195524_GSM2465521_KrasT_45649_NoDox'] ['Homo_sapiens', 'Homo_sapiens', 'Mus_musculus', 'Mus_musculus'] [1, 2, 1, 2]
However subsequently snakemake does not take these into account and produces the wrong wildcard values, despite the fact I'm not using glob_wildcards.
I'm clearly missing something, but I can't figure out what I'm doing wrong. I've also looked into the following post: Manually create snakemake wildcards .
Help would be much appreciated!
Upvotes: 2
Views: 470
Reputation: 9062
This is an attempt to explain why snakemake generates wildcard values that do not match the user's values. Consider this tiny example.
We want to create file a_b.A_B.txt
and this Snakefile will do the job:
FOO = ['a_b']
BAR = ['A_B']
rule all:
input:
expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),
rule one:
output:
'{x}_{y}.txt',
shell:
r"""
touch {output}
"""
Try it with snakemake -p -j 1 -n
:
Building DAG of jobs...
...
[Thu Aug 11 09:43:30 2022]
rule one:
output: a_b.A_B.txt
jobid: 1
reason: Missing output files: a_b.A_B.txt
wildcards: x=a_b.A, y=B
resources: tmpdir=/tmp
touch a_b.A_B.txt
...
It works, but a couple of counterintuitive things happen:
Wildcard names in rule all, {foo}
and {bar}
, do not match those in rule one, {x}
and {y}
. Snakemake doesn't care about it, it just sees wildcards that can take any value.
Rule all asks for file {wildcard1}.{wildcard2}
(note the dot .
) but rule one has output {wildcard1}_{wildcard2}
(note the underscore _
). There seem to be a mismatch and the script should fail. Instead it works because...
Snakemake has freedom to find regex values that match inputs to outputs. In this case, assigning wildcards: x=a_b.A
and y=B
will match the requested a_b.A_B.txt
. (In my opinion this is counterintuitive).
Rewriting this snakefile using constraints leads to less surprising behavior:
FOO = ['a_b']
BAR = ['A_B']
wildcard_constraints:
x='|'.join([re.escape(x) for x in FOO]),
y='|'.join([re.escape(x) for x in BAR]),
rule all:
input:
expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),
rule one:
output:
'{x}_{y}.txt',
shell:
r"""
touch {output}
"""
Fails with:
snakemake -p -j 1 -n
Building DAG of jobs...
MissingInputException in line 8 of /home/dario/Downloads/Snakefile:
Missing input files for rule all:
affected files:
a_b.A_B.txt
This version works and uses the expected wildcard values:
FOO = ['a_b']
BAR = ['A_B']
wildcard_constraints:
x='|'.join([re.escape(x) for x in FOO]),
y='|'.join([re.escape(x) for x in BAR]),
rule all:
input:
expand('{foo}.{bar}.txt', zip, foo=FOO, bar=BAR),
rule one:
output:
'{x}.{y}.txt', # Use dot, not underscore
shell:
r"""
touch {output}
"""
snakemake -p -j 1 -n
...
rule one:
output: a_b.A_B.txt
jobid: 1
reason: Missing output files: a_b.A_B.txt
wildcards: x=a_b, y=A_B
resources: tmpdir=/tmp
touch a_b.A_B.txt
...
Note that we get the expected wildcard values x=a_b
and y=A_B
. I would also use consistent naming of wildcards (foo, bar OR x, y) unless there is a good reason not to.
The command x = '|'.join([re.escape(x) for x in FOO])
tells snakemake that wildcard x
can only take values that match the regular expression '|'.join([re.escape(x) for x in FOO])
(|
it's pipe symbol, not I
). So that value a_b.A
will not match and the script will fail. escape
is to ensure that special characters like .
and *
are not interpreted as regexes.
This is my understanding - I think it would good to have a dedicated documentation about it...
Upvotes: 3
Reputation: 16581
This is an alternative approach that leverages python
side of the Snakefile
. The idea here is to construct a set of rules that generate a specific output.
# --- Importing Configuration Files --- #
configfile: "/DATA/config/config.yaml"
table_cols = ['dataset','sample','species','frr']
table_samples = pd.read_table('/DATA/config/samples.tsv', header=0, sep='\t', names=table_cols)
DATASET = table_samples.dataset.values.tolist()
SAMPLE = table_samples['sample'].values.tolist()
SPECIES = table_samples.species.values.tolist()
FRR = table_samples.frr.values.tolist()
print(DATASET,SAMPLE,SPECIES,FRR)
rule all:
input:
expand(config["project_path"]+"results/{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.html", zip, dataset=DATASET, sample=SAMPLE, species=SPECIES, frr=FRR)
## fastq files quality control
for dataset, sample, species, frrr in zip(DATASET,SAMPLE,SPECIES,FRR):
rule:
name: f'{dataset}+{sample}+{species}+{frrr}' # look how easy it's to find another symbol to separate the wildcards
output:
# I am simplifying this as formatting nested code is tricky
# but the key idea is that now you want to apply something like f-string formatting to essentially hard-code specific outputs for specific rules
abc=f"{dataset}/rawQC/{sample}_{species}_RNA-Seq_{frr}_fastqc.zip"
shell: # make sure to hard-code shell contents also with f-string formatting
f"echo 'dataset is {dataset}'"
This approach is less desirable, however, since the number of rules grows fast (one rule for each desired output) and in general debugging these rules might get trickier. The purpose of the code above is to show that in principle it's possible to avoid using wildcard_constraints
even when working with filenames that are difficult/ambiguous for snakemake to parse.
Upvotes: 0
Reputation: 16581
As mentioned here, the naming should be improved. If there is no way around this, one option is to apply wildcard_constraints
:
# regex-safe approach as suggested by @dariober
from re import escape
UNIQ_SPECIES = [escape(x) for x in table_samples.species.unique()]
wildcard_constraints:
species="|".join(UNIQ_SPECIES)
Upvotes: 2