ZachTurn
ZachTurn

Reputation: 655

Running tasks in parallel with Makefile

I'm having issues structuring my Makefile to run my shell scripts in the desired order.

Here is my current makefile

## Create data splits
raw_data: src/data/get_data.sh
    src/data/get_data.sh
    hadoop fs -cat data/raw/target/* >> data/raw/target.csv
    hadoop fs -cat data/raw/control/* >> data/raw/control.csv
    hadoop fs -rm -r -f data/raw
    touch raw_data_loaded

split_data: raw_data_loaded
    rm -rf data/interim/splits
    mkdir data/interim/splits    
    $(PYTHON_INTERPRETER) src/data/split_data.py

## Run Models
random_forest: split_data
    nohup $(PYTHON_INTERPRETER) src/models/random_forest.py > random_forest & 

under_gbm: split_data
    nohup $(PYTHON_INTERPRETER) src/models/undersampled_gbm.py > under_gbm &

full_gbm: split_data
    nohup $(PYTHON_INTERPRETER) src/models/full_gbm.py > full_gbm &

# Create predictions from model files
predictions: random_forest under_gbm full_gbm
    nohup $(PYTHON_INTERPRETER) src/models/predictions.py > predictions &

The Problem

Everything works ok until I start the ##Run Models section. These are all independent scripts, which can all run once split_data is finished. I want to run each of the 3 model scripts simultaneously, so I run each in the background with &.

The problem is that my last task, predictions begins to run at the same time as the three preceding tasks. What I Want to happen is for the 3 simultaneous model scripts to finish, and then predictions runs.

My Attempt

My proposed solution is to run my final model task, full_gbm without the &, so that predictions doesn't run until that is finished. This should work, but I'm wondering if there is a less 'hacky' way to achieve this -- is there some way to structure the target variables to achieve the same result?

Upvotes: 1

Views: 1385

Answers (1)

Toby Speight
Toby Speight

Reputation: 30968

You don't say which implementation of Make you're using. If it's GNU Make, you can invoke it with the -j option to allow it to decide which jobs should be run in parallel. Then you can remove the nohup and & from all the commands; predictions won't start until all of random_forest under_gbm full_gbm have completed, and the build itself won't end until predictions has completed.

Also, you won't lose the all-important exit status of the commands.

Upvotes: 2

Related Questions