How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule?

Question

On a machine with j cores, given a RuleB which depends on a RuleA, I expect to Snakemake to execute my workflow as follows:

RuleA Sample1 using j threads
RuleA Sample2 using j threads
...
RuleA SampleN using j threads
RuleB Sample1 using 1 thread
RuleB Sample2 using 1 thread
...
RuleB SampleN using 1 thread

With RuleB being executed on j samples simultaneously.

Instead the workflow is executed as follows:

RuleA Sample1 using j threads
RuleB Sample1 using 1 thread
RuleA Sample2 using j threads
RuleB Sample2 using 1 thread
...

with ruleB being executed on 1 sample at a time.

Executed in that order, ruleB can't be parallelised, and the workflow runs much slower than it could.

More specifically, I want to align reads to a genome using STAR, and quantify them using RNASeQC. The tool RNASEQC is single threaded, while STAR can work with multiple threads on a single sample.

This results in Snakemake aligning reads in sample1, and then quantifying them using rnaseqc, after which it proceeds to do the same in in sample2. I'd like it to reads in all samples first, and proceed to quantify them (this way, it would be able to run several instances of the single-threaded rnaseqc tool).

The relevant excerpt from the Snakemake file:

sample_basename = ["RNA-seq_L{}_S{}".format(x, y) for x,y in zip(range(1,41), range(1,41))]
sample_lane = [seq + "_L00{}".format(x) for x in [1, 2] for seq in sample_basename]

rule all:
    input:
        expand("rnaseqc/{s_l}/{s_l}.gene_tpm.gct", s_l=sample_lane)

rule run_star:
    input: 
        index_dir=rules.star_index.output.index_dir, 
        fq1 = "data/fastq/{sample}_R1_001.fastq.gz",
        fq2 = "data/fastq/{sample}_R2_001.fastq.gz",
    output:
        "star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
        "star/{sample}/{sample}Aligned.toTranscriptome.out.bam",
        "star/{sample}/{sample}ReadsPerGene.out.tab",
        "star/{sample}/{sample}Log.final.out"
    log:
        "logs/star/{sample}.log"
    params:
        extra="--quantMode GeneCounts TranscriptomeSAM --chimSegmentMin 20 --outSAMtype BAM SortedByCoordinate",
        sample_name = "{sample}"
    threads: 18 
    script:
        "scripts/star_align.py"

rule rnaseqc:
    input: 
        bam="star/{sample}/{sample}Aligned.sortedByCoord.out.bam",
        gtf="data/gencode.v19.annotation.patched.collapsed.gtf"
    output:
        "rnaseqc/{sample}/{sample}.exon_reads.gct",
        "rnaseqc/{sample}/{sample}.gene_fragments.gct",
        "rnaseqc/{sample}/{sample}.gene_reads.gct",
        "rnaseqc/{sample}/{sample}.gene_tpm.gct",
        "rnaseqc/{sample}/{sample}.metrics.tsv"
    params:
        extra="-s {sample} --legacy",
        output_dir="rnaseqc/{sample}"
    log:
        "logs/rnaseqc/{sample}"
    shell:
        "rnaseqc.v2.3.4.linux {params.extra} {input.gtf} {input.bam} {params.output_dir} 2> {log}"

Weirdly enough, doing a dry run with snakemake -np -j does the correct thing:

[Mon Oct 21 13:08:11 2019]
rule run_star:
    input: data/STAR/, data/fastq/RNA-seq_L182_S16_L002_R1_001.fastq.gz, data/fastq/RNA-seq_L182_S16_L002_R2_001.fastq.gz
    output: star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.sortedByCoord.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Aligned.toTranscriptome.out.bam, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002ReadsPerGene.out.tab, star/RNA-seq_L182_S16_L002/RNA-seq_L182_S16_L002Log.final.out
    log: logs/star/RNA-seq_L182_S16_L002.log
    jobid: 1026
    wildcards: sample=RNA-seq_L182_S16_L002
    threads: 18

[Mon Oct 21 13:08:11 2019]
rule run_star:
    input: data/STAR/, data/fastq/RNA-seq_L173_S7_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L173_S7_L001_R2_001.fastq.gz
    output: star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001ReadsPerGene.out.tab, star/RNA-seq_L173_S7_L001/RNA-seq_L173_S7_L001Log.final.out
    log: logs/star/RNA-seq_L173_S7_L001.log
    jobid: 737
    wildcards: sample=RNA-seq_L173_S7_L001
    threads: 18
...
[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
    input: star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
    output: rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.exon_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_fragments.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_reads.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.gene_tpm.gct, rnaseqc/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001.metrics.tsv
    log: logs/rnaseqc/RNA-seq_L221_S15_L001
    jobid: 215
    wildcards: sample=RNA-seq_L221_S15_L001

rnaseqc.v2.3.4.linux -s RNA-seq_L221_S15_L001 --legacy data/gencode.v19.annotation.patched.collapsed.gtf star/RNA-seq_L221_S15_L001/RNA-seq_L221_S15_L001Aligned.sortedByCoord.out.bam rnaseqc/RNA-seq_L221_S15_L001 2> logs/rnaseqc/RNA-seq_L221_S15_L001

[Mon Oct 21 13:10:50 2019]
rule rnaseqc:
    input: star/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
    output: rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.exon_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_fragments.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_reads.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.gene_tpm.gct, rnaseqc/RNA-seq_L284_S38_L001/RNA-seq_L284_S38_L001.metrics.tsv
    log: logs/rnaseqc/RNA-seq_L284_S38_L001
    jobid: 278
    wildcards: sample=RNA-seq_L284_S38_L001

but executing snakemake -j without the -np flag does not.


[Mon Oct 21 13:13:49 2019]
rule run_star:
    input: data/STAR/, data/fastq/RNA-seq_L249_S3_L001_R1_001.fastq.gz, data/fastq/RNA-seq_L249_S3_L001_R2_001.fastq.gz
    output: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.toTranscriptome.out.bam, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001ReadsPerGene.out.tab, star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Log.final.out
    log: logs/star/RNA-seq_L249_S3_L001.log
    jobid: 813
    wildcards: sample=RNA-seq_L249_S3_L001
    threads: 18

Aligning RNA-seq_L249_S3_L001
[Mon Oct 21 13:21:33 2019]
Finished job 813.
2 of 478 steps (0.42%) done

[Mon Oct 21 13:21:33 2019]
rule rnaseqc:
    input: star/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001Aligned.sortedByCoord.out.bam, data/gencode.v19.annotation.patched.collapsed.gtf
    output: rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.exon_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_fragments.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_reads.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.gene_tpm.gct, rnaseqc/RNA-seq_L249_S3_L001/RNA-seq_L249_S3_L001.metrics.tsv
    log: logs/rnaseqc/RNA-seq_L249_S3_L001
    jobid: 243
    wildcards: sample=RNA-seq_L249_S3_L001

I'm using the latest version of Snakemake available through Conda: 5.5.2

dariober · Accepted Answer

Maybe what you are looking for is to give higher priority to the rule running STAR compared to the rule running rnaseqc. If so, look at the priorities directive, like:

rule star:
    priority: 50
    ...

rule rnaseqc:
    priority: 0
    ...

(Not tested) this should run first all the star jobs, one at a time because they need 18 cores each, then all the rnaseqc jobs in parallel.

How do I get Snakemake to apply all samples to a single rule, before proceeding to the next rule?

Answers (1)

Related Questions