Reputation: 76
On a machine with j cores, given a RuleB which depends on a RuleA, I expect to Snakemake to execute my workflow as follows:
RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread
With RuleB being executed on j
samples simultaneously.
Instead the workflow is executed as follows:
RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...
with ruleB being executed on 1 sample at a time.
Executed in that order, ruleB can't be parallelised, and the workflow runs much slower than it could.
More specifically, I want to align reads to a genome using STAR, and quantify them using RNASeQC. The tool RNASEQC is single threaded, while STAR can work with multiple threads on a single sample.
This results in Snakemake aligning reads in sample1, and then quantifying them using rnaseqc, after which it proceeds to do the same in in sample2. I'd like it to reads in all samples first, and proceed to quantify them (this way, it would be able to run several instances of the single-threaded rnaseqc tool).
The relevant excerpt from the Snakemake file:
sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]
rule all:
input:
expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)
rule run_star:
input:
index_dir=rules.star_index.output.index_dir,
fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
output:
"star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
"star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
"star/{sample}/{sample}ReadsPerGene.out.tab",
"star/{sample}/{sample}Log.final.out"
log:
"logs/star/{sample}.log"
params:
extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
sample_name = "{sample}"
threads: 18
script:
"scripts/star_align.py"
rule rnaseqc:
input:
bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
output:
"rnaseqc/{sample}/{sample}.exon_reads.gct",
"rnaseqc/{sample}/{sample}.gene_fragments.gct",
"rnaseqc/{sample}/{sample}.gene_reads.gct",
"rnaseqc/{sample}/{sample}.gene_tpm.gct",
"rnaseqc/{sample}/{sample}.metrics.tsv"
params:
extra="-s {sample} --legacy",
output_dir="rnaseqc/{sample}"
log:
"logs/rnaseqc/{sample}"
shell:
"rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"
Weirdly enough, doing a dry run with snakemake -np -j
does the correct thing:
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
log: logs/star/RNA-seq_L182_S16_L002.log
jobid: 1026
wildcards: sample=RNA-seq_L182_S16_L002
threads: 18
[Mon Oct 21 13:08:11 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
log: logs/star/RNA-seq_L173_S7_L001.log
jobid: 737
wildcards: sample=RNA-seq_L173_S7_L001
threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L221_S15_L001
jobid: 215
wildcards: sample=RNA-seq_L221_S15_L001
rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L284_S38_L001
jobid: 278
wildcards: sample=RNA-seq_L284_S38_L001
but executing snakemake -j
without the -np
flag does not.
[Mon Oct 21 13:13:49 2019]
rule run_star:
input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
log: logs/star/RNA-seq_L249_S3_L001.log
jobid: 813
wildcards: sample=RNA-seq_L249_S3_L001
threads: 18
Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done
[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
log: logs/rnaseqc/RNA-seq_L249_S3_L001
jobid: 243
wildcards: sample=RNA-seq_L249_S3_L001
I'm using the latest version of Snakemake available through Conda: 5.5.2
Upvotes: 1
Views: 377
Reputation: 9062
Maybe what you are looking for is to give higher priority to the rule running STAR compared to the rule running rnaseqc. If so, look at the priorities directive, like:
rule star:
priority: 50
...
rule rnaseqc:
priority: 0
...
(Not tested) this should run first all the star jobs, one at a time because they need 18 cores each, then all the rnaseqc jobs in parallel.
Upvotes: 4