Recommended python scientific workflow management tool that defines dependency completeness on parameter state rather than time?

Question

It's past time for me to move from my custom scientific workflow management (python) to some group effort. In brief, my workflow involves long running (days) processes with a large number of shared parameters. As a dependency graph, nodes are tasks that produce output or do some other work. That seems fairly universal in workflow tools.

However, key to my needs is that each task is defined by the parameters it requires. Tasks are instantiated with respect to the state of those parameters and all parameters of its dependencies. Thus if a task has completed its job according to a given parameter state, it is complete and not rerun. This parameter state is NOT the global parameter state but only what is relevant to that part of the DAG. This reliance on parameter state rather than time completed appears to be the essential difference between my needs and existing tools (at least what I have gathered from a quick look at Luigi and Airflow). Time completed might be one such parameter, but in general it is not the time that determines a (re)run of the DAG, but whether the parameter state is congruent with the parameter state of the calling task. There are non-trivial issues (to me) with 'parameter explosion' and the relationship to parameter state and the DAG, but those are not my question here.

My question -- which existing python tool would more readily allow defining 'complete' with respect to this parameter state? It's been suggested that Luigi is compatible with my needs by writing a custom complete method that would compare the metadata of built data ('targets') with the needed parameter state.

How about Airflow? I don't see any mention of this issue but have only briefly perused the docs. Since adding this functionality is a significant effort that takes away from my 'scientific' work, I would like to start out with the better tool. Airflow definitely has momentum but my needs may be too far from its purpose. Defining the complete parameter state is needed for two reasons -- 1) with complex, long running tasks, I can't just re-run the DAG every time I change some parameter in the very large global parameter state, and 2) I need to know how the intermediate and final results have been produced for scientific and data integrity reasons.

Recommended python scientific workflow management tool that defines dependency completeness on parameter state rather than time?

Answers (1)

Related Questions