Bill
Bill

Reputation: 11633

How to register dependencies programmatically in a Python DVC pipeline

I want to run a sequence of experiments and each experiment will use certain input data files (dependencies), each of which I want to prepare when an experiment is run. (Some experiments will use the same input data sets so they won't need to be re-generated during subsequent experiments).

At first I thought I could do this with one 'master' pipeline that loops over each experiment:

dvc.yaml

stages:
  prepare_data:
    foreach: ${experiment_names}
    do:
      cmd: python stages/prepare_data.py "${item}"
      deps:
        - source_data
        - stages/prepare_data.py
      params:
        - prepare_data
      outs:
        - input_data
  run_simulation:
    foreach: ${experiment_names}
    do:
      cmd: python stages/run_simulation.py "${item}"
      deps:
        - input_data
        - stages/run_simulation.py
      params:
        - run_simulation
      outs:
        - results

The specific source data file used by each experiment may be different, and it will be determined in prepare_data.py based on the experiment name which I pass to it and some exp_spec.yaml file that it will load or perhaps from the params.yaml file.

What I'm struggling with is how to register the specific dependencies so that (i) when an experiment requires an input data file that has already been prepared it doesn't regenerate it (ii) when one of the source data files is changed, only the simulations that use that file are re-run.

Obviously this can't be done in the above dvc.yaml file because it relates to all the experiments.

Do I need to build separate pipeline for each experiment to register the specific dependencies? If so, can this be done programmatically, or do I need to build them all by hand?

UPDATE

I completed the pipeline by writing simple scripts for prepare_data.py, run_simulation.py and params.yaml and tried to run it with the above dvc.yaml file.

This is the output:

Reproducing experiment 'lobar-snob'                                                                                                                                
Building workspace index                                                                                                               |1.46k [00:00, 3.76kentry/s]
Comparing indexes                                                                                                                      |1.42k [00:00, 5.59kentry/s]
WARNING: No file hash info found for '/dvc_pipelines/test_pipeline/results'. It won't be created.                                  
WARNING: No file hash info found for '/dvc_pipelines/test_pipeline/input_data'. It won't be created.                               
Applying changes                                                                                                                         |0.00 [00:00,     ?file/s]
ERROR: output 'input_data' is specified in:                           
        - prepare_data@test_exp_2
        - prepare_data@test_exp_1
Use `dvc remove` with any of the above targets to stop tracking the overlapping output.

As I expected, the problem seems to be related to the use of folder names, input_data and results, instead of specific files.

Upvotes: 1

Views: 114

Answers (2)

Bill
Bill

Reputation: 11633

After more research I've concluded that I can't do what I wanted to do above in one DVC file. What I need is to separate the two stages into two files because the prepare_data stage loops over data files and only the second one loops over experiments.

I also discovered that you can pass more information into the foreach loop by making the iteration variable a dictionary rather than a string and using item.attribute to access it's attributes. This is a nice way to set the specific input and output dependencies programmatically.

So I need two dvc files along the following lines:

input_data/dvc.yaml

stages:
  prepare_data:
    foreach: ${input_datasets}
    do:
      cmd: python stages/prepare_data.py "${key}"
      deps:
        - ${item.source_data_file}
        - stages/prepare_data.py
      params:
        - prepare_data
      outs:
        - ${item.input_data_file}

experiments/dvc.yaml

stages:
  run_simulation:
    foreach: ${experiments}
    do:
      cmd: python stages/run_simulation.py "${key}"
      deps:
        - ../input_data/${item.input_data_file}
        - stages/run_simulation.py
      params:
        - run_simulation
      outs:
        - results/${key}

What I've learned is that DVC is not intended for lazy evaluation. It's main goal is to propagate changes to all downstream nodes of the graph and to avoid unnecessary re-evaluation of stages that have not changed.

The compromise with the above solution is that all the specified input data files will be generated, regardless of whether they are eventually used in the run simulations stage or not.

But this way the dependency graph will be built and any change to an input data file will trigger the reevaluation of all simulations that depend on it.

Upvotes: 0

Shcheklein
Shcheklein

Reputation: 6349

From the Run Cache documentation:

Every time you run a pipeline with DVC, it logs the unique signature of each stage run (in .dvc/cache/runs). If it never happened before, its command(s) are executed normally. Every subsequent time a stage runs under the same conditions, the previous results can be restored instantly — without wasting time or computing resources.

Is it relevant to your case, could you check if it is working already?

Upvotes: 0

Related Questions