ricopan
ricopan

Reputation: 682

Recommended python scientific workflow management tool that defines dependency completeness on parameter state rather than time?

It's past time for me to move from my custom scientific workflow management (python) to some group effort. In brief, my workflow involves long running (days) processes with a large number of shared parameters. As a dependency graph, nodes are tasks that produce output or do some other work. That seems fairly universal in workflow tools.

However, key to my needs is that each task is defined by the parameters it requires. Tasks are instantiated with respect to the state of those parameters and all parameters of its dependencies. Thus if a task has completed its job according to a given parameter state, it is complete and not rerun. This parameter state is NOT the global parameter state but only what is relevant to that part of the DAG. This reliance on parameter state rather than time completed appears to be the essential difference between my needs and existing tools (at least what I have gathered from a quick look at Luigi and Airflow). Time completed might be one such parameter, but in general it is not the time that determines a (re)run of the DAG, but whether the parameter state is congruent with the parameter state of the calling task. There are non-trivial issues (to me) with 'parameter explosion' and the relationship to parameter state and the DAG, but those are not my question here.

My question -- which existing python tool would more readily allow defining 'complete' with respect to this parameter state? It's been suggested that Luigi is compatible with my needs by writing a custom complete method that would compare the metadata of built data ('targets') with the needed parameter state.

How about Airflow? I don't see any mention of this issue but have only briefly perused the docs. Since adding this functionality is a significant effort that takes away from my 'scientific' work, I would like to start out with the better tool. Airflow definitely has momentum but my needs may be too far from its purpose. Defining the complete parameter state is needed for two reasons -- 1) with complex, long running tasks, I can't just re-run the DAG every time I change some parameter in the very large global parameter state, and 2) I need to know how the intermediate and final results have been produced for scientific and data integrity reasons.

Upvotes: 2

Views: 377

Answers (1)

ricopan
ricopan

Reputation: 682

I looked further into Luigi and Airflow and as far as I could discern neither of these is suitable for modification for my needs. The primary incompatibility is that these tools are fundamentally based on predetermined DAGs/workflows. My existing framework operates on instantiated and fully specified DAGs that are discovered at run-time rather than concisely described externally -- necessary because knowing whether each task is complete, for a given request, is dependent on many combinations of parameter values that define the output of that task and the utilized output of all upstream tasks. By instantiated, I mean the 'intermediate' results of individual runs each described by the full parameter state (variable values) necessary to reproduce (withstanding any stochastic element) identical output from that task.

So a 'Scheduler' that operates on a DAG ahead of time is not useful.

In general, most existing workflow frameworks, at least in python, that I've glanced at appear more to be designed to automate many relatively simple tasks in an easily scalable and robust manner with parallelization, with little emphasis put on the incremental building up of more complex analyses with results that must be reused when possible designed to link complex and expensive computational tasks the output of which may likely in turn be used as input for an additional unforeseen analysis.

I just discovered the 'Prefect' workflow this morning, and am intrigued to learn more -- at least it is clearly well funded ;-). My initial sense is that it may be less reliant on pre-scheduling and thus more fluid and more readily adapted to my needs, but that's just a hunch. In many ways some of my more complex 'single' tasks might be well suited to wrap an entire Prefect Flow if they played nicely together. It seems my needs are on the far end of the spectrum of deep complicated DAGs (I will not try to write mine out!) with never ending downstream additions.

I'm going to look into Prefect and Luigi more closely and see what I can borrow to make my framework more robust and less baroque. Or maybe I can add a layer of full data description to Prefect...

UPDATE -- discussing with Prefect folks, clear that I need to start with the underlying Dask and see if it is flexible enough -- perhaps using Dask delayed or futures. Clearly Dask is extraordinary. Graphchain built on top of Dask is a move in the right direction by facilitating permanent storage of 'intermediate' output computed over a dependency 'chain' identified by hash of code base and parameters. Pretty close to what I need, though with more explicit handling of those parameters that deterministically define the outputs.

Upvotes: 2

Related Questions