Parallelise an IO-heavy for loop: stupid idea?

Question

For every file in a certain directory, I need to read its contents, and do something to that file based on its contents.

I thought I'd parallelize that, so that multiple files can be dealt with simultaneously. (I used python joblib.)

But it was slower than the sequential implementation.

Is that because each operation on a file involves IO, and IO cannot be parallelized? So there is no speed-up from parallelization, and there is a slowdown due to switching between all of the forked processes?

More details:

227,732 files (all of them .dat and <100 kB).
1 quad-core CPU.
Ubuntu 13.04.

time taken for sequential: 9 secs.
time taken for parallel: 64 secs.

from joblib import Parallel, delayed

def parallel(data_dir,dirlist):
  Parallel(n_jobs=-1)(delayed(good_or_bad_train_case)(filename, data_dir) 
                      for filename in dirlist if filename.endswith('.dat'))

def sequential(data_dir,dirlist):
  t = time.clock()
  [good_or_bad_train_case(filename,data_dir) for filename in 
   dirlist if filename.endswith('.dat')]

def good_or_bad_file(filename,data_dir):
  fullname = os.path.join(data_dir, filename)
  rootname = os.path.splitext(filename)[0]
  f = open(fullname)
  content = f.readlines()
  if 'NoPhotoOfJoint
' in content or 'PoorPhoto
' in content:
    os.symlink(fullname,data_dir+'/bad_data/'+rootname+'.jpg')
    os.symlink(fullname,data_dir+'/bad_data/'+rootname+'.dat')
  else: 
    os.symlink(fullname,data_dir+'/good_data/'+rootname+'.jpg')
    os.symlink(fullname,data_dir+'/good_data/'+rootname+'.dat')

Note: I'm aware that there wasn't much point in parallelising such a light operation; this was practice.

Fred Foo · Accepted Answer

There are several things to be aware of here:

joblib carries a lot of overhead because it has to spawn separate processes using Python's multiprocessing module, then communicate with those processes to exchange data. It's really meant for scientific computing workloads, where this overhead is offset by heavy computations in the child processes.
All the worker processes are accessing the same disk. Only one of them can pull data off the disk at any time, so if the rest don't have not enough computation to be done, they sit around waiting. (Disk bandwidths are puny compared to CPU bandwidths.)
Because disks are so slow, you typically want two workers in this situation, not "as many as there are CPUs/cores" since every one beyond the second only adds overhead. Two workers can keep both the CPU and one core busy at the same time (though they won't give you even a two-fold speedup).
joblib 0.8 has a threading backend that you can try.

Parallelise an IO-heavy for loop: stupid idea?

Answers (1)

Related Questions