Reputation: 1023
I am currently writing a Snakefile, which does a lot of post-alignment quality control (CollectInsertSizeMetics, CollectAlignmentSummaryMetrics, CollectGcBiasMetrics
, ...).
At the very end of the Snakefile, I am running multiQC to combine all the metrics in one html report.
I know that if I use the output of rule A as input of rule B, rule B will only be executed after rule A is finished.
The problem in my case is that the input of multiQC is a directory, which exists right from the start. Inside of this directory, multiQC will search for certain files and then create the report.
If I am currently executing my Snakemake file, multiQC will be executed before all quality controls will be performed (e.g. fastqc
takes quite some time), thus these are missing in the final report.
So my question is, if there is an option, that specifies that a certain rule is executed last.
I know that I could use --wait-for-files
to wait for a certain fastqc
report, but that seems very inflexible.
The last rule currently looks like this:
rule multiQC:
input:
input_dir = "post-alignment-qc"
output:
output_html="post-alignment-qc/multiQC/mutliqc-report.html"
log:
err='post-alignment-qc/logs/fastQC/multiqc_stderr.err'
benchmark:
"post-alignment-qc/benchmark/multiQC/multiqc.tsv"
shell:
"multiqc -f -n {output.output_html} {input.input_dir} 2> {log.err}"
Any help is appreciated!
Upvotes: 3
Views: 997
Reputation: 9062
You could give to the input of multiqc
rule the files produced by the individual QC rules. In this way, multiqc will start once all those files are available:
samples = ['a', 'b', 'c']
rule collectInsertSizeMetrics:
input:
'{sample}.bam',
output:
'post-alignment-qc/{sample}.insertSizeMetrics.txt' # <- Some file produced by CollectInsertSizeMetrics
shell:
"CollectInsertSizeMetics {input} > {output}"
rule CollectAlignmentSummaryMetrics:
output:
'post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt'
rule multiqc:
input:
expand('post-alignment-qc/{sample}.insertSizeMetrics.txt', sample=samples),
expand('post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt', sample=samples),
shell:
"multiqc -f -n {output.output_html} post-alignment-qc 2> {log.err}"
Upvotes: 3
Reputation: 8108
This seems like a classic A-B-problem to me. Your hidden assumption is that because multiqc
asks for a directory in its command line arguments, the Snakemake rule's input needs to be a directory. This is wrong.
Inputs should be all the things that required for a rule to be able to run. Having a folder exist, does not satisfy this requirement, because the folder can be empty. This is exactly the problem you're running into.
What you really need to be there are the files that are produced by other commands to be present in the folder. So you should define the required input to be the files. That's the Snakemake way. Stop thinking in terms of: this rule needs to run last. Think in terms of: these files need to be there for this rule to run.
Upvotes: 1