Chris Betti
Chris Betti

Reputation: 2923

gnu make dependencies for data processing

I'm trying to set up an ETL system using GNU Make 3.81. The idea is to transform and load only what is necessary after a change to my source data.

My project's directory layout looks like this:

${SCRIPTS}/        <- transform & load scripts
${DATA}/incoming/  <- storage for extracted data
${DATA}/processed/ <- transformed, soon-to-be-loaded data

My ${TRANSFORM_SCRIPTS}/Makefile is filled with statements like this:

A_step_1: ${SCRIPTS}/A/do_step_1.sh ${DATA}/incoming/A_files/*
        ${SCRIPTS}/A/do_step_1.sh ${DATA}/incoming/A_files/* > ${DATA}/processed/A.step_1

A_step_2: ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1
        ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1 > ${DATA}/processed/A.step_2

B_step_1: ${SCRIPTS}/B/do_step_1.sh ${DATA}/incoming/B_files/*
        ${SCRIPTS}/B/do_step_1.sh ${DATA}/incoming/B_files/* > ${DATA}/processed/B.step_1

B_step_2: ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1
        ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1 > ${DATA}/processed/B.step_2

joined: A_step_2 B_step_2
        join ${DATA}/processed/A.step_2 ${DATA}/processed/B.step_2 > ${DATA}/processed/joined

Calling `make joined' successfully produces the "joined" file I need, but it rebuilds every file every time, despite there being no changes to the dependency files.

I tried using the output file names as targets, but GNU Make doesn't seem to know how to cope:

${DATA}/processed/B.step_2: ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1
        ${SCRIPTS}/B/do_step_2.sh ${DATA}/processed/B.step_1 > ${DATA}/processed/B.step_2

Any suggestions other than dropping the output of each process in the current working directory? Make seems like a reasonable tool to perform this work because, in reality, there tens of data sources and close to 100 steps altogether, and managing dependencies myself via script files is becoming too difficult.

Upvotes: 2

Views: 284

Answers (1)

Ahmed Masud
Ahmed Masud

Reputation: 22402

You can do one of two things:

Either fix the target and its dependencies with something like:

  JOINED=${DATA}/processed/joined 

  $(JOINED): ${DATA}/processed/A.step_2 ${DATA}/processed/B.step_2

or in the steps you can end each step with a

  touch $@

for example:

A_step_2: ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1
        ${SCRIPTS}/A/do_step_2.sh ${DATA}/processed/A.step_1 > ${DATA}/processed/A.step_2 && touch $@ || $(RM) $@

including the joined step.

but this is ugly.

Upvotes: 2

Related Questions