Python: multiprocessing, pathos and what not

Question

I have to apologise in advance 'cause this question is quite general and may be not clear enough. The question is: how would you run in parallel a Python function that itself uses a pool of processes for some subtasks and does lots of heavy I/O operations? Is it even a valid task?

I will try to provide some more information. I've got a procedure, say test_reduce(), that I need to run in parallel. I tried several ways to do that (see below), and I seem to lack some knowledge to understand why all of them fail.

This test_reduce() procedure does lots of things. Some of those are more relevant to the question than others (and I list them below):

It uses the multiprocessing module (sic!), namely a pool.Pool instance,
It uses a MongoDB connection,
It relies heavily on numpy and scikit-learn libs,
It uses callbacks and lambdas,
It uses the dill lib to pickle some stuff.

First I tried to use a multiprocessing.dummy.Pool (which seems to be a thread pool). I don't know what is specific about this pool and why it is, eh, "dummy"; the whole thing worked, and I got my results. The problem is CPU load. For parallelized sections of test_reduce() it was 100% for all cores; for synchronous sections it was around 40-50% most of the time. I can't say there was any increase in overall speed for this type of "parallel" execution.

Then I tried to use a multiprocessing.pool.Pool instance to map this procedure to my data. It failed with the following:

File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
cPickle.PicklingError: Can't pickle : attribute lookup thread.lock failed

I made a guess that cPickle is to blame, and found the pathos lib that uses a far more advanced pickler dill. However it also fails:

File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
    obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()

Now, this error is something I don't understand at all. I've got no output to stdout from my procedure when it works in a pool, so it's hard to guess what's going on. The only thing I know is that test_reduce() runs successfully when no multiprocessing is used.

So, how would you run in parallel something that heavy and complicated?

oopcode · Accepted Answer

So, thanks to @MikeMcKerns' answer, I found how to get the job done with the pathos lib. I needed to get rid of all pymongo cursors, which (being generators) could not be pickled by dill; doing that solved the problem and I managed to run my code in parallel.

Python: multiprocessing, pathos and what not

Answers (1)

Related Questions