tedtoal
tedtoal

Reputation: 1070

How can Snakemake be made to update files in a hierarchical rule-based manner when a new file appears at the bottom of the hierarchy?

I have a snakefile with dozens of rules, and it processes thousands of files. This is a bioinformatics pipeline for DNA sequencing analysis. Today I added two more samples to my set of samples, and I expected to be able to run snakemake and it would automatically determine which rules to run on which files to process the new sample files and all files that depend on them on up the hierarchy to the very top level. However, it does nothing. And the -R option doesn't do it either.

The problem is illustrated with this snakefile:

> cat tst
rule A:
    output: "test1.txt"
    input: "test2.txt"
    shell: "cp {input} {output}"

rule B:
    output: "test2.txt"
    input: "test3.txt"
    shell: "cp {input} {output}"

rule C:
    output: "test3.txt"
    input: "test4.txt"
    shell: "cp {input} {output}"

rule D:
    output: "test4.txt"
    input: "test5.txt"
    shell: "cp {input} {output}"

Execute it as follows:

> rm test*.txt
> touch test2.txt
> touch test1.txt
> snakemake -s tst -F

Output is:

Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   A
    1

rule A:
    input: test2.txt
    output: test1.txt
    jobid: 0

Finished job 0.
1 of 1 steps (100%) done

Since test5.txt does not exist, I expected an error message to that effect, but it did not happen. And of course, test3.txt and test4.txt do not exist.

Furthermore, using -R instead of -F results in "Nothing to be done." Using "-R A" runs rule A only.

This relates to my project in that it shows that Snakemake does not analyze the entire dependent tree if you tell it to build a rule at the top of the tree and that rule's output and input files already exist. And the -R option does not force it either. When I tried -F on my project, it started rebuilding the entire thing, including files that did not need to be rebuilt.

It seems to me that this is fundamental to what Snakemake should be doing, and I just don't understand it. The only way I can see to get my pipeline to analyze the new samples is to individually invoke each rule required for the new files, in order. And that is way too tedious and is one reason why I used Snakemake in the first place.

Help!

Upvotes: 0

Views: 925

Answers (2)

Johannes Köster
Johannes Köster

Reputation: 1927

Snakemake does not automatically trigger re-runs when adding new input files (e.g. samples) to the DAG. However, you can enforce this as outlined in the FAQ.

The reason for not doing this by default is mostly consistency: in order to do this, Snakemake needs to store meta information. Hence, if the meta information is lost, you would have a different behavior than if it was there. However, I might change this in the future. With such fundamental changes though, I am usually very careful in order to not forget a counter example where the current default behavior is of advantage.

Upvotes: 1

Barry Moore
Barry Moore

Reputation: 21

Remember that snakemake wants to satisfy the dependency of the first rule and builds the graph by pulling additional dependencies through the rest of the graph to satisfy that initial dependency. By touching test2.txt you've satisfied the dependency for the first rule, so nothing more needs to be done. Even with -R A nothing else needs to be run to satisfy the dependency of rule A - the files already exist.

Snakemake definitely does do what you want (add new samples and the entire rule graph runs on those samples) and you don't need to individually invoke each rule, but it seems to me that you might be thinking of the dependencies wrong. I'm not sure I fully understand where your new samples fit into the tst example you've given but I see at least two possibilites.

Your graph dependency runs D->C->B->A, so if you're thinking that you've added new input data at the top (i.e. a new sample as test5.txt in rule D), then you need to be sure that you have a dependency at your endpoint (test2.txt in rule A). By touching test2.txt you've just completed your pipeline, so no dependencies exist. If touch test5.txt (that's your new data) then your example works and the entire graph runs.

Since you touched test1.txt and test2.txt in your example execution maybe you intended those to represent the new samples. If so then you need to rethink your dependency graph, because adding them doesn't create a dependency on the rest of the graph. In your example, the test2.txt file is your terminal dependency (the final dependency of your workflow not the input to it). In your tst example new data needs come is as test5.txt as input to rule D (the top of your graph) and get pulled through the dependency graph to satisfy an input dependency of rule A which is test2.txt. If you're thinking of either test1.txt or test2.txt as your new input then you need to remember that snakemake pulls data through the graph to satisfy dependencies at the bottom of the graph, it doesn't push data from the top down. Run snakemake -F --rulegraph see that the graph runs D->C->B->A and so new data needs to come is as input to rule D and be pulled through the graph as a dependency to rule A.

Upvotes: 0

Related Questions