Reputation: 315
I am a beginner in the field of python parallel computation. I want to parallize one part of my codes which consists for-loops. By default, my code run a time
for loop for 3 years. In each day
, my code calls a bash script run_offline.sh
and run it 8 times. Each time the bash script is given different input data indexing by the loop id. Here is the main part of my python codes demo.py
:
import os
import numpy as np
from dateutil.rrule import rrule, HOURLY, DAILY
import datetime
import subprocess
...
start_date = datetime.datetime(2017, 1, 1, 10, 0, 0)
end_date = datetime.datetime(2019, 12, 31, 10, 0, 0)
...
loop_id = 0
for date_assim in rrule(freq=HOURLY,
dtstart=start_date,
interval=time_delta,
until=end_date):
RESDIR='./results/'
TYP='experiment_1'
END_ID = 8
YYYYMMDDHH = date_assim.strftime('%Y%m%d%H')
p1 = subprocess.Popen(['./run_offline.sh', str(END_ID), str(loop_id), str(YYYYMMDDHH), RESDIR, TYP])
p1.wait()
#%%
# p1 creates 9 files from RESULTS_0.nc to RESULTS_8.nc
# details please see the bash script attached below
# following are codes computing based on RESULTS_${CID}.nc, CID from 0 to 8.
# In total 9 files are generated by p1 and used later.
loop_id += 1
And the ./run_offline.sh
runs an atmsopheric model offline.exe
9 times which follows:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
CID=0
ln -sf PREP_0.nc PREP.nc # one of the input file required. Must named by PREP.nc
while [ $CID -le $END_ID ]; do
cp -f ./OPTIONS.nam_${CID} ./OPTIONS.nam # one of the input file required by offline.exe
# different ./OPTIONS.nam_${CID} has different index of a perturbation.
# Say ./OPTIONS.nam_1 lets the offline.exe knows it should perturb the first variable in the atmospheric model,
# ./OPTIONS.nam_2 perturbs the second variable...
./offline.exe
cp RESULTS1.nc RESULTS_${CID}.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
CID=$((CID+1))
done
Now I found the for-loop of offline.exe
is super time-consuming. It's around 10-20s
each time I called run_offline.sh
(running ./offline.exe
9 times costs 10-20s). In total it costs 15s*365*3=4.5hours
on average, if I want to run my scripts for 3 years...So can I parallize the loop of offline.exe
? say assign the different run of different CID
to different core/subprocess in the server. But one should note that two input files OPTIONS.nam
and PREP.nc
are forced to name as same names when we run offline.exe
each time....which means we cannot use OPTIONS.nam_x
for loop x
. So can I use dask
or numba
to help this parallelization? Thanks!
Upvotes: 1
Views: 173
Reputation: 33740
If you cannot make offline.exe
use other names than OPTIONS.nam and RESULTS1.nc, you will need to make sure that the parallel instances do not overwrite eachother.
One way to do this is to make a dir for each run:
#!/bin/bash
# Usage: ./run_offline.sh END_ID loop_id YYYYMMDDHH RESDIR TYP
END_ID=${1:-1}
loop_id=${2:-1}
YYYYMMDDHH=${3:-1}
RESDIR=${4:-1}
TYP=${5:-1}
END_ID=`echo $((END_ID))`
loop_id=`echo $((loop_id))`
doit() {
mkdir $1
cd $1
ln -sf ../PREP_$1.nc PREP.nc
cp -f ./OPTIONS.nam_$1 ./OPTIONS.nam # one of the input file required by
../offline.exe
cp RESULTS1.nc ../RESULTS_$1.OUT.nc # for next part of python codes in demo.py
mv RESULTS2.nc $RESDIR/$TYP/RESULTS2_${YYYYMMDDHH}.nc # store this file in my results dir
}
export -f doit
export RESDIR YYYYMMDDHH TYP
seq 0 $END_ID | parallel doit
Upvotes: 1
Reputation: 50826
If I understand your problem correctly, you run ~1000 times a bash script which runs 8~9 time a black-box executable and this executable is the main bottleneck.
So can I parallize the loop of offline.exe?
This is hard to say due to the executable being a black box. You need to check the input/output/temporary data required by the program. For example, if the program store temporary file somewhere in the storage device, then calling it in parallel will result in a race condition. Besides, you can only call it in parallel if the computational parts are fully independent. A dataflow analysis is very useful to know whether you can parallelize an application (especially when it is composed of multiple programs).
Additionally, you need to check if the program is already parallel or not. Running in parallel multiple parallel programs generally results in a much slower execution due to a large amount of thread to schedule, poor cache usage, bad synchronization patterns, etc.
In your case, I think the best option would be to parallelize the program run in the loop (ie. offline.exe
). Otherwise, if the program is sequential and can be parallelised (see above), then you can run multiple processes using &
in bash and then wait them in the end of the loop. Alternatively you can use GNU parallel.
But one should note that two input files OPTIONS.nam and PREP.nc are forced to name as same names when we run offline.exe each time
This can be solved by calling the N program from N distinct working directories in parallel. This is actually safer if the program creates temporary files in its working directory. You need to move/copy the files before the parallel execution and certainly after.
If the files OPTIONS.nam
and/or PREP.nc
are modified by the program, then it means the computation is completely sequential and cannot be parallelized (I assume the computation of each day is dependent of the previous one as this is a very common pattern in scientific simulations).
So can I use dask or numba to help this parallelization?
No. Dask and Numba are not mean to be used in this context. They are designed to operate on Numpy array in a Python code. The part you want to parallelize is in bash and the parallelized program is apparently not even written in Python.
Upvotes: 1