Xu Shan
Xu Shan

Reputation: 315

numba or dask: parallel to call bash script by python

I am a beginner in the field of python parallel computation. I want to parallize one part of my codes which consists for-loops. By default, my code run a time for loop for 3 years. In each day, my code calls a bash script run_offline.sh and run it 8 times. Each time the bash script is given different input data indexing by the loop id. Here is the main part of my python codes demo.py:

import os
import numpy as np
from dateutil.rrule import rrule, HOURLY, DAILY
import datetime
import subprocess
...

start_date = datetime.datetime(2017, 1, 1, 10, 0, 0)
end_date = datetime.datetime(2019, 12, 31, 10, 0, 0)

...
loop_id = 0
for date_assim in rrule(freq=HOURLY,
                        dtstart=start_date,
                        interval=time_delta,
                        until=end_date):
    RESDIR='./results/'
    TYP='experiment_1'
    END_ID = 8
    YYYYMMDDHH = date_assim.strftime('%Y%m%d%H')
    p1 = subprocess.Popen(['./run_offline.sh', str(END_ID), str(loop_id), str(YYYYMMDDHH), RESDIR, TYP])
    p1.wait()

    #%%
    # p1 creates 9 files from RESULTS_0.nc to RESULTS_8.nc
    # details please see the bash script attached below
    # following are codes computing based on RESULTS_${CID}.nc, CID from 0 to 8. 
    # In total 9 files are generated by p1 and used later.
loop_id += 1

And the ./run_offline.sh runs an atmsopheric model offline.exe 9 times which follows:

#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP


END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`

CID=0
ln -sf PREP_0.nc PREP.nc # one of the input file required. Must named by PREP.nc

while [ $CID -le $END_ID ]; do
  cp -f ./OPTIONS.nam_${CID} ./OPTIONS.nam # one of the input file required by offline.exe
  # different ./OPTIONS.nam_${CID} has different index of a perturbation. 
  # Say ./OPTIONS.nam_1 lets the offline.exe knows it should perturb the first variable in the atmospheric model, 
  # ./OPTIONS.nam_2 perturbs the second variable...
  
  ./offline.exe
  cp RESULTS1.nc RESULTS_${CID}.OUT.nc # for next part of python codes in demo.py
  mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
  CID=$((CID+1))
done

Now I found the for-loop of offline.exe is super time-consuming. It's around 10-20s each time I called run_offline.sh (running ./offline.exe 9 times costs 10-20s). In total it costs 15s*365*3=4.5hours on average, if I want to run my scripts for 3 years...So can I parallize the loop of offline.exe? say assign the different run of different CID to different core/subprocess in the server. But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time....which means we cannot use OPTIONS.nam_x for loop x. So can I use dask or numba to help this parallelization? Thanks!

Upvotes: 1

Views: 173

Answers (2)

Ole Tange
Ole Tange

Reputation: 33740

If you cannot make offline.exe use other names than OPTIONS.nam and RESULTS1.nc, you will need to make sure that the parallel instances do not overwrite eachother.

One way to do this is to make a dir for each run:

#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP


END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`

doit() {
  mkdir $1
  cd $1
  ln -sf ../PREP_$1.nc PREP.nc
  cp -f ./OPTIONS.nam_$1 ./OPTIONS.nam # one of the input file required by 
  ../offline.exe
  cp RESULTS1.nc ../RESULTS_$1.OUT.nc # for next part of python codes in demo.py
  mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
}
export -f doit
export RESDIR YYYYMMDDHH TYP
seq 0 $END_ID | parallel doit

Upvotes: 1

Jérôme Richard
Jérôme Richard

Reputation: 50826

If I understand your problem correctly, you run ~1000 times a bash script which runs 8~9 time a black-box executable and this executable is the main bottleneck.

So can I parallize the loop of offline.exe?

This is hard to say due to the executable being a black box. You need to check the input/output/temporary data required by the program. For example, if the program store temporary file somewhere in the storage device, then calling it in parallel will result in a race condition. Besides, you can only call it in parallel if the computational parts are fully independent. A dataflow analysis is very useful to know whether you can parallelize an application (especially when it is composed of multiple programs).

Additionally, you need to check if the program is already parallel or not. Running in parallel multiple parallel programs generally results in a much slower execution due to a large amount of thread to schedule, poor cache usage, bad synchronization patterns, etc.

In your case, I think the best option would be to parallelize the program run in the loop (ie. offline.exe). Otherwise, if the program is sequential and can be parallelised (see above), then you can run multiple processes using & in bash and then wait them in the end of the loop. Alternatively you can use GNU parallel.

But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time

This can be solved by calling the N program from N distinct working directories in parallel. This is actually safer if the program creates temporary files in its working directory. You need to move/copy the files before the parallel execution and certainly after.

If the files OPTIONS.nam and/or PREP.nc are modified by the program, then it means the computation is completely sequential and cannot be parallelized (I assume the computation of each day is dependent of the previous one as this is a very common pattern in scientific simulations).

So can I use dask or numba to help this parallelization?

No. Dask and Numba are not mean to be used in this context. They are designed to operate on Numpy array in a Python code. The part you want to parallelize is in bash and the parallelized program is apparently not even written in Python.

Upvotes: 1

Related Questions