Reputation: 7044
For every file in a certain directory, I need to read its contents, and do something to that file based on its contents.
I thought I'd parallelize that, so that multiple files can be dealt with simultaneously. (I used python joblib.)
But it was slower than the sequential implementation.
Is that because each operation on a file involves IO, and IO cannot be parallelized? So there is no speed-up from parallelization, and there is a slowdown due to switching between all of the forked processes?
More details:
227,732 files (all of them .dat and <100 kB).
1 quad-core CPU.
Ubuntu 13.04.
time taken for sequential: 9 secs.
time taken for parallel: 64 secs.
from joblib import Parallel, delayed
def parallel(data_dir,dirlist):
Parallel(n_jobs=-1)(delayed(good_or_bad_train_case)(filename, data_dir)
for filename in dirlist if filename.endswith('.dat'))
def sequential(data_dir,dirlist):
t = time.clock()
[good_or_bad_train_case(filename,data_dir) for filename in
dirlist if filename.endswith('.dat')]
def good_or_bad_file(filename,data_dir):
fullname = os.path.join(data_dir, filename)
rootname = os.path.splitext(filename)[0]
f = open(fullname)
content = f.readlines()
if 'NoPhotoOfJoint\r\n' in content or 'PoorPhoto\r\n' in content:
os.symlink(fullname,data_dir+'/bad_data/'+rootname+'.jpg')
os.symlink(fullname,data_dir+'/bad_data/'+rootname+'.dat')
else:
os.symlink(fullname,data_dir+'/good_data/'+rootname+'.jpg')
os.symlink(fullname,data_dir+'/good_data/'+rootname+'.dat')
Note: I'm aware that there wasn't much point in parallelising such a light operation; this was practice.
Upvotes: 2
Views: 522
Reputation: 363517
There are several things to be aware of here:
multiprocessing
module, then communicate with those processes to exchange data. It's really meant for scientific computing workloads, where this overhead is offset by heavy computations in the child processes.Upvotes: 3