Creating a data analysis pipeline using (GNU) make

Question

I'm a scientist analyzing brain data collected from multiple subjects. During the analysis, the data is processed through multiple steps, a bit like a cooking recipe. At the end of the line, there is a step that collects the processed data for all the individual subjects and creates summary statistics and so forth.

As a single step can take up to an hour to complete, I would like to have an automated way to run all the steps for all subjects and compute the summary statistics, without repeating steps that have already completed in the past.

Make seems like a good utility to use, but I need some help with the structure of the Makefile. Here is a simplified example:

# Keep intermediate files!
.SECONDARY:

# In this simplified example, there are 3 subjects, in reality there are more 
SUBJECTS = subject_a subject_b subject_c

# In this simplified example there are 3 data processing steps, each one taking
# one file as input and emitting one file as output. In reality, there are more
# steps and each step takes multiple input files and emits multiple output
# files.
step1_%.dat : step1.py input_%.dat
    touch step1_$*.dat

step2_%.dat : step2.py step1_%.dat
    touch step2_$*.dat

# Let's say this step produces many output files
STEP3_PROD = step3_%_1.dat step3_%_2.dat step3_%_3.dat
$(STEP3_PROD) : step3.py step2_%.dat
    touch $(STEP3_PROD)

# Meta rule to perform the complete analysis for a single subject
.PHONY : $(SUBJECTS)
subject_% : step1_%.dat step2_%.dat $(STEP3_PROD)
    @echo 'Analysis complete for subject $*.'

# The summary depends on the analysis of all subjects being complete.
summary.dat : summary.py $(SUBJECTS)
    touch summary.dat
    @echo 'All analysis done!'

all : summary.dat

The problem with the above Makefile is that the summary step python summary.py is always performed, even when nothing has changed. This is probably due to the fact that it depends on the phony subject_% rule, which is always build.

Is there a way to structure this script, so that the summary step will not be performed unnecessarily? Perhaps there is some way to expand $(STEP3_PROD) for all subjects?

keltar · Accepted Answer

Don't overcompilcate things or they will backfire. Try something like:

.SECONDARY:

all: summary.dat

SUBJECTS:=a b c
SUBJECT_RULES:=$(addprefix subject_, $(SUBJECTS))
.PHONY: $(SUBJECT_RULES)

subject_a: step3_a_1.dat
subject_b: step3_b_1.dat
subject_c: step3_c_1.dat

step1_%.dat: input_%.dat
    touch $@

step2_%.dat: step1_%.dat
    touch $@

step3_%_1.dat: step2_%.dat
    touch $@

STEP3_PRE:=$(addprefix step3_, $(SUBJECTS))
STEP3_1_OUT:=$(addsuffix _1.dat, $(STEP3_PRE))
STEP3_ALL_OUT:=$(STEP3_1_OUT) \
    $(addsuffix _2.dat, $(STEP3_PRE)) \
    $(addsuffix _3.dat, $(STEP3_PRE))

summary.dat: $(STEP3_1_OUT)
    @echo "summary: $(STEP3_ALL_OUT)"
    touch $@

I see no need for tracking step3_%_2.dat and so on since they're rebuilt with step3_%_1.dat anyway.

Creating a data analysis pipeline using (GNU) make

Answers (1)

Related Questions