Reputation: 11633
I want to run a sequence of experiments and each experiment will use certain input data files (dependencies), each of which I want to prepare when an experiment is run. (Some experiments will use the same input data sets so they won't need to be re-generated during subsequent experiments).
At first I thought I could do this with one 'master' pipeline that loops over each experiment:
dvc.yaml
stages:
prepare_data:
foreach: ${experiment_names}
do:
cmd: python stages/prepare_data.py "${item}"
deps:
- source_data
- stages/prepare_data.py
params:
- prepare_data
outs:
- input_data
run_simulation:
foreach: ${experiment_names}
do:
cmd: python stages/run_simulation.py "${item}"
deps:
- input_data
- stages/run_simulation.py
params:
- run_simulation
outs:
- results
The specific source data file used by each experiment may be different, and it will be determined in prepare_data.py
based on the experiment name which I pass to it and some exp_spec.yaml
file that it will load or perhaps from the params.yaml
file.
What I'm struggling with is how to register the specific dependencies so that (i) when an experiment requires an input data file that has already been prepared it doesn't regenerate it (ii) when one of the source data files is changed, only the simulations that use that file are re-run.
Obviously this can't be done in the above dvc.yaml file because it relates to all the experiments.
Do I need to build separate pipeline for each experiment to register the specific dependencies? If so, can this be done programmatically, or do I need to build them all by hand?
UPDATE
I completed the pipeline by writing simple scripts for prepare_data.py
, run_simulation.py
and params.yaml
and tried to run it with the above dvc.yaml
file.
This is the output:
Reproducing experiment 'lobar-snob'
Building workspace index |1.46k [00:00, 3.76kentry/s]
Comparing indexes |1.42k [00:00, 5.59kentry/s]
WARNING: No file hash info found for '/dvc_pipelines/test_pipeline/results'. It won't be created.
WARNING: No file hash info found for '/dvc_pipelines/test_pipeline/input_data'. It won't be created.
Applying changes |0.00 [00:00, ?file/s]
ERROR: output 'input_data' is specified in:
- prepare_data@test_exp_2
- prepare_data@test_exp_1
Use `dvc remove` with any of the above targets to stop tracking the overlapping output.
As I expected, the problem seems to be related to the use of folder names, input_data
and results
, instead of specific files.
Upvotes: 1
Views: 114
Reputation: 11633
After more research I've concluded that I can't do what I wanted to do above in one DVC file. What I need is to separate the two stages into two files because the prepare_data stage loops over data files and only the second one loops over experiments.
I also discovered that you can pass more information into the foreach
loop by making the iteration variable a dictionary rather than a string and using item.attribute
to access it's attributes. This is a nice way to set the specific input and output dependencies programmatically.
So I need two dvc files along the following lines:
input_data/dvc.yaml
stages:
prepare_data:
foreach: ${input_datasets}
do:
cmd: python stages/prepare_data.py "${key}"
deps:
- ${item.source_data_file}
- stages/prepare_data.py
params:
- prepare_data
outs:
- ${item.input_data_file}
experiments/dvc.yaml
stages:
run_simulation:
foreach: ${experiments}
do:
cmd: python stages/run_simulation.py "${key}"
deps:
- ../input_data/${item.input_data_file}
- stages/run_simulation.py
params:
- run_simulation
outs:
- results/${key}
What I've learned is that DVC is not intended for lazy evaluation. It's main goal is to propagate changes to all downstream nodes of the graph and to avoid unnecessary re-evaluation of stages that have not changed.
The compromise with the above solution is that all the specified input data files will be generated, regardless of whether they are eventually used in the run simulations stage or not.
But this way the dependency graph will be built and any change to an input data file will trigger the reevaluation of all simulations that depend on it.
Upvotes: 0
Reputation: 6349
From the Run Cache documentation:
Every time you run a pipeline with DVC, it logs the unique signature of each stage run (in .dvc/cache/runs). If it never happened before, its command(s) are executed normally. Every subsequent time a stage runs under the same conditions, the previous results can be restored instantly — without wasting time or computing resources.
Is it relevant to your case, could you check if it is working already?
Upvotes: 0