Reputation: 4929
I am very confused about how Snakemake works to parallelize jobs in a rule. I'd like to use a core per input (treating inputs separately, not with space between them), providing multiple outputs per input.
Here's an simplified example of my code:
# Globals ---------------------------------------------------------------------
datasets = ["dataset_S1", "dataset_S2"]
methods = ["pbs", "pbs_windowed", "ihs", "xpehh"]
# Rules -----------------------------------------------------------------------
rule all:
input:
# Binary files
expand("{dataset}/results_bin/persnp/{method}.feather", dataset=datasets, method=methods),
expand("{dataset}/results_bin/pergene/{method}.feather", dataset=datasets, method=methods)
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
"{dataset}/results_bin/persnp/{method}.feather",
"{dataset}/results_bin/pergene/{method}.feather"
threads:
2
shell:
"Rscript {input}
If a run the code above using snakemake -j2
, I ended up re-running each script per output method, which is not what I want. If I use the expand()
function for the input and output in the bin rule as well, I would ending up using:
shell:
"""
Rscript {input[0]}
Rscript {input[1]}
"""
which I think it won't be possible to parallelize.
What should I do to take each input separately so that I can use one core for each?
Any help would be greatly appreciated. Thx!
EDIT
Trying to better explain what my script does, and which behavior I expect from Snakemake. See my example folder structure:
.
├── dataset_S1
│ ├── data
│ │ └── data.vcf
│ ├── results_bin
│ │ └── convert2feather.R
│ ├── task2
│ │ └── script.py
│ └── task3
│ └── script.sh
└── dataset_S2
├── data
│ └── data.vcf
├── results_bin
│ └── convert2feather.R
├── task2
│ └── script.py
└── task3
└── script.sh
As you can see, for each dataset, I have folders with the same structure and named scripts (although the content from the scripts may differ a bit). In my example, the script will read the "data.vcf" file, manipulate it, then create new folders and files in the respective dataset folder. It'll be repeating the overall tasks to both datasets. I'd like to do that in a way I'll be able to do the same with the scripts in the folder task2, task3, and so on...
For instance, the output from my pipeline in this example will be:
.
├── dataset_S1
│ ├── data
│ │ └── data.vcf
│ └── results_bin
│ ├── convert2feather.R
│ ├── pergene
│ │ ├── ihs.feather
│ │ ├── pbs.feather
│ │ ├── pbs_windowed.feather
│ │ └── xpehh.feather
│ └── persnp
│ ├── ihs.feather
│ ├── pbs.feather
│ ├── pbs_windowed.feather
│ └── xpehh.feather
└── dataset_S2
├── data
│ └── data.vcf
└── results_bin
├── convert2feather.R
├── pergene
│ ├── ihs.feather
│ ├── pbs.feather
│ ├── pbs_windowed.feather
│ └── xpehh.feather
└── persnp
├── ihs.feather
├── pbs.feather
├── pbs_windowed.feather
└── xpehh.feather
EDIT2
File and commands used:
(snakemake) cmcouto-silva@datascience-IB:~/[email protected]/lab_files/phd_data$ snakemake -j2 -p
# Globals ---------------------------------------------------------------------
datasets = ["dataset_S1", "dataset_S2"]
methods = ["pbs", "pbs_windowed", "ihs", "xpehh"]
# Rules -----------------------------------------------------------------------
rule all:
input:
# Binary files
expand("{dataset}/results_bin/persnp/{method}.feather", dataset=datasets, method=methods),
expand("{dataset}/results_bin/pergene/{method}.feather", dataset=datasets, method=methods)
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
expand("{{dataset}}/results_bin/persnp/{method}.feather", method=methods),
expand("{{dataset}}/results_bin/pergene/{method}.feather", method=methods)
threads:
2
shell:
"Rscript {input}"
Output log:
(snakemake) cmcouto-silva@datascience-IB:~/[email protected]/lab_files/phd_data$ snakemake -j2 -p
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
2 bin
3
[Wed Sep 30 23:47:55 2020]
rule bin:
input: dataset_S1/results_bin/convert2feather.R
output: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather
jobid: 1
wildcards: dataset=dataset_S1
threads: 2
Rscript dataset_S1/results_bin/convert2feather.R
Package "data.table" successfully loaded!
Package "magrittr" successfully loaded!
Package "snpsel" successfully loaded!
[Wed Sep 30 23:48:43 2020]
Finished job 1.
1 of 3 steps (33%) done
[Wed Sep 30 23:48:43 2020]
rule bin:
input: dataset_S2/results_bin/convert2feather.R
output: dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 2
wildcards: dataset=dataset_S2
threads: 2
Rscript dataset_S2/results_bin/convert2feather.R
Package "data.table" successfully loaded!
Package "magrittr" successfully loaded!
Package "snpsel" successfully loaded!
[Wed Sep 30 23:49:41 2020]
Finished job 2.
2 of 3 steps (67%) done
[Wed Sep 30 23:49:41 2020]
localrule all:
input: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 0
[Wed Sep 30 23:49:41 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/cmcouto-silva/[email protected]/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
(snakemake) cmcouto-silva@datascience-IB:~/[email protected]/lab_files/phd_data$ cat /home/cmcouto-silva/[email protected]/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
2 bin
3
[Wed Sep 30 23:47:55 2020]
rule bin:
input: dataset_S1/results_bin/convert2feather.R
output: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather
jobid: 1
wildcards: dataset=dataset_S1
threads: 2
Rscript dataset_S1/results_bin/convert2feather.R
[Wed Sep 30 23:48:43 2020]
Finished job 1.
1 of 3 steps (33%) done
[Wed Sep 30 23:48:43 2020]
rule bin:
input: dataset_S2/results_bin/convert2feather.R
output: dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 2
wildcards: dataset=dataset_S2
threads: 2
Rscript dataset_S2/results_bin/convert2feather.R
[Wed Sep 30 23:49:41 2020]
Finished job 2.
2 of 3 steps (67%) done
[Wed Sep 30 23:49:41 2020]
localrule all:
input: dataset_S1/results_bin/persnp/pbs.feather, dataset_S1/results_bin/persnp/pbs_windowed.feather, dataset_S1/results_bin/persnp/ihs.feather, dataset_S1/results_bin/persnp/xpehh.feather, dataset_S2/results_bin/persnp/pbs.feather, dataset_S2/results_bin/persnp/pbs_windowed.feather, dataset_S2/results_bin/persnp/ihs.feather, dataset_S2/results_bin/persnp/xpehh.feather, dataset_S1/results_bin/pergene/pbs.feather, dataset_S1/results_bin/pergene/pbs_windowed.feather, dataset_S1/results_bin/pergene/ihs.feather, dataset_S1/results_bin/pergene/xpehh.feather, dataset_S2/results_bin/pergene/pbs.feather, dataset_S2/results_bin/pergene/pbs_windowed.feather, dataset_S2/results_bin/pergene/ihs.feather, dataset_S2/results_bin/pergene/xpehh.feather
jobid: 0
[Wed Sep 30 23:49:41 2020]
Finished job 0.
3 of 3 steps (100%) done
Complete log: /home/cmcouto-silva/[email protected]/lab_files/phd_data/.snakemake/log/2020-09-30T234755.741940.snakemake.log
Upvotes: 0
Views: 400
Reputation: 326
Am I getting it right that you have two input files (your scripts, one per dataset) for this rule and you want them to run in parallel? If so, you need to give the snakemake call twice the amount of cores that you defined in the rule.
The threads
field in the rule gives the number of cores you want to use on this rule per input/iteration. So, the first dataset will use 2 cores and the second dataset will also use 2 cores. To run them in parallel, you'd need to call snakemake -j4
.
I hope I understood your problem, if not, feel free to correct me.
Upvotes: 1
Reputation: 4089
I'm not sure I understand correctly, but it appears to me there will be three "method" outfiles for each "dataset" infile. If so, this should work.
rule bin:
input:
"{dataset}/results_bin/convert2feather.R"
output:
expand("{{dataset}}/results_bin/persnp/{method}.feather", method=methods),
expand("{{dataset}}/results_bin/pergene/{method}.feather", method=methods)
Upvotes: 2