Reputation: 1148
I am just getting started with snakemake and was wondering what's the "correct" way to run a set of parameters on the same file and how this will work for chaining of rules?
So for example, when I want to have multiple normalization methods, followed by let's say a clustering rule with a varying number of k clusters. What would be the best way to do this such that all combinations are run?
I started doing this:
INFILES = ["mytable"]
rule preprocess:
input:
bam=expand("data/{sample}.csv", sample=INFILES, param=config["normmethod"])
output:
bamo=expand("results/{sample}_pp_{param}.csv", sample=INFILES, param=config["normmethod"])
script:
"scripts/preprocess.py"
And then invoked the script via:
snakemake --config normmethod=Median
But that doesn't really scale for further options later in the workflow. For example, how would I incorporate these set of options automatically?
normmethods= ["Median", "Quantile"]
kclusters= [1,3,5,7,10]
Upvotes: 3
Views: 6147
Reputation: 472
This answer is similar to @Shiping's answer, which is to use wildcards in the output
of a rule to implement multiple parameters per input file. However, this answer provides a more detailed example and avoids using complex list comprehension, regular expression, or the glob
module.
@Pereira Hugo's approach uses one job to run all parameter combinations for one input file, whereas the approach in this answer uses one job to run one parameter combination for one input file, which makes it easier to parallelize the execution of each parameter combination on one input file.
Snakefile
:
import os
data_dir = 'data'
sample_fns = os.listdir(data_dir)
sample_pfxes = list(map(lambda p: p[:p.rfind('.')],
sample_fns))
res_dir = 'results'
params1 = [1, 2]
params2 = ['a', 'b', 'c']
rule all:
input:
expand(os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv'),
sample=sample_pfxes, param1=params1, param2=params2)
rule preprocess:
input:
csv=os.path.join(data_dir, '{sample}.csv')
output:
csv=os.path.join(res_dir, '{sample}_p1_{param1}_p2_{param2}.csv')
shell:
"ls {input.csv} && \
echo P1: {wildcards.param1}, P2: {wildcards.param2} > {output.csv}"
Directory structure before running snakemake
:
$ tree .
.
├── Snakefile
├── data
│ ├── sample_1.csv
│ ├── sample_2.csv
│ └── sample_3.csv
└── results
Run snakemake
:
$ snakemake -p
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
18 preprocess
19
rule preprocess:
input: data/sample_1.csv
output: results/sample_1_p1_2_p2_a.csv
jobid: 1
wildcards: param2=a, sample=sample_1, param1=2
ls data/sample_1.csv && echo P1: 2, P2: a > results/sample_1_p1_2_p2_a.csv
data/sample_1.csv
Finished job 1.
1 of 19 steps (5%) done
rule preprocess:
input: data/sample_2.csv
output: results/sample_2_p1_2_p2_a.csv
jobid: 2
wildcards: param2=a, sample=sample_2, param1=2
ls data/sample_2.csv && echo P1: 2, P2: a > results/sample_2_p1_2_p2_a.csv
data/sample_2.csv
Finished job 2.
2 of 19 steps (11%) done
...
localrule all:
input: results/sample_1_p1_1_p2_a.csv, results/sample_1_p1_2_p2_a.csv, results/sample_2_p1_1_p2_a.csv, results/sample_2_p1_2_p2_a.csv, results/sample_3_p1_1_p2_a.csv, results/sample_3_p1_2_p2_a.csv, results/sample_1_p1_1_p2_b.csv, results/sample_1_p1_2_p2_b.csv, results/sample_2_p1_1_p2_b.csv, results/sample_2_p1_2_p2_b.csv, results/sample_3_p1_1_p2_b.csv, results/sample_3_p1_2_p2_b.csv, results/sample_1_p1_1_p2_c.csv, results/sample_1_p1_2_p2_c.csv, results/sample_2_p1_1_p2_c.csv, results/sample_2_p1_2_p2_c.csv, results/sample_3_p1_1_p2_c.csv, results/sample_3_p1_2_p2_c.csv
jobid: 0
Finished job 0.
19 of 19 steps (100%) done
Directory structure after running snakemake
:
$ tree . [18:51:12]
.
├── Snakefile
├── data
│ ├── sample_1.csv
│ ├── sample_2.csv
│ └── sample_3.csv
└── results
├── sample_1_p1_1_p2_a.csv
├── sample_1_p1_1_p2_b.csv
├── sample_1_p1_1_p2_c.csv
├── sample_1_p1_2_p2_a.csv
├── sample_1_p1_2_p2_b.csv
├── sample_1_p1_2_p2_c.csv
├── sample_2_p1_1_p2_a.csv
├── sample_2_p1_1_p2_b.csv
├── sample_2_p1_1_p2_c.csv
├── sample_2_p1_2_p2_a.csv
├── sample_2_p1_2_p2_b.csv
├── sample_2_p1_2_p2_c.csv
├── sample_3_p1_1_p2_a.csv
├── sample_3_p1_1_p2_b.csv
├── sample_3_p1_1_p2_c.csv
├── sample_3_p1_2_p2_a.csv
├── sample_3_p1_2_p2_b.csv
└── sample_3_p1_2_p2_c.csv
Sample result:
$ cat results/sample_2_p1_1_p2_a.csv [19:12:36]
P1: 1, P2: a
Upvotes: 1
Reputation: 337
You did well using the function expand() in your rule.
For the parameters I recommend to use a configuration file containing all your parameters. Snakemake works with YAML & JSON files. Here you got all the information about these two formats:
In your case you just have in a YAML file to write this :
INFILES : "mytables"
normmethods : ["Median", "Quantile"]
or
normmethods : - "Median"
- "Quantile"
kclusters : [1,3,5,7,10]
or
kclusters : - 1
- 3
- 5
- 7
- 10
Write your rule like this :
rule preprocess:
input:
bam = expand("data/{sample}.csv",
sample = config["INFILES"])
params :
kcluster = config["kcluster"]
output:
bamo = expand("results/{sample}_pp_{method}_{cluster}.csv",
sample = config["INFILES"],
method = config["normmethod"],
cluster = config["kcluster"])
script:
"scripts/preprocess.py {input.bam} {params.kcluster}"
Then you just have to lunch like this :
snakemake --configfile path/to/config.yml
For running with others parameters you will have to modify your configuration file and not your snakefile (making less mistakes) and it is better for the readability and code beauty.
EDIT :
rule preprocess:
input:
bam = "data/{sample}.csv"
Just to correct my own mistake, you don't need to use expand here on the input since you just want to run the rule one file .csv by one. So just put the wildcard here and Snakemake will do his part.
Upvotes: 9
Reputation: 1337
Seems you didn't pass the params to your script. How about something like the following?
import re
import os
import glob
normmethods= ["Median", "Quantile"] # can be set from config['normmethods']
kclusters= [1,3,5,7,10] # can be set from config['kclusters']
INFILES = ['results/' + re.sub('\.csv$', '_pp_' + m + '-' + str(k) + '.csv', re.sub('data/', '', file)) for file in glob.glob("data/*.csv") for m in normmethods for k in kclusters]
rule cluster:
input: INFILES
rule preprocess:
input:
bam="data/{sample}.csv"
output:
bamo="results/{sample}_pp_{m}-{k}.csv"
run:
os.system("scripts/preprocess.py %s %s %s %s" % (input.bame, output.bamo, wildcards.m, wildcards.k))
Upvotes: 5