wlad
wlad

Reputation: 2183

Make "make" rerun only the parts of a pipeline that come after what has changed

I'm using make, I believe GNU Make on WSL. I'm using it for data science, building on cookiecutter-datascience.

In my mind before I started using make, the point of it was to keep track of which parts of a pipeline have changed, and only rerun the stages in the pipeline after it.

Here's a snippet from the makefile:

## Install Python Dependencies
requirements: test_environment
    pip install -U pip setuptools wheel
    pip install -r requirements.txt

## Make Dataset
data: requirements
    $(PYTHON_INTERPRETER) src/data/make_dataset.py

When I run make data, it doesn't just rerun the data part of the pipeline and the succeeding stages, but it also reruns make requirements and make test_environment. But this is the opposite of what I want. Those stages come before, not after. If I have an expensive pipeline, I obviously don't want to rerun it over and over again.

In my case, I want it so that: If one of the raw (un-preprocessed) data files changes, I want it to rerun the data preprocessing. This should not include things like tracking whether the libraries have changed, because those steps logically preceed the data preprocessing.

Upvotes: 0

Views: 179

Answers (1)

Renaud Pacalet
Renaud Pacalet

Reputation: 29240

You can try this:

## Install Python Dependencies
requirements.done: test_environment.done
    pip install -U pip setuptools wheel
    pip install -r requirements.txt
    touch requirements.done

## Make Dataset
data: requirements.done
    $(PYTHON_INTERPRETER) src/data/make_dataset.py

Make compares date of last modification of files. Your requirements, test_environment... are not files, they are what is called "phony" targets. As they don't exist, make tries to build them as soon as they are needed. If you want make to discover that something is up to date and does not need to be rebuilt, you must use files. The proposed solution uses empty, dummy, files (the *.done files), instead of your phony targets. These files are used only to store the date of actions in their last modification times.

Of course, you can use files named requirements, test_environment... if you prefer. The .done extension is just a way to identify these files as dummy markers.

Upvotes: 1

Related Questions