Reputation: 1962
I have to apologise in advance 'cause this question is quite general and may be not clear enough. The question is: how would you run in parallel a Python function that itself uses a pool of processes for some subtasks and does lots of heavy I/O operations? Is it even a valid task?
I will try to provide some more information. I've got a procedure, say test_reduce()
, that I need to run in parallel. I tried several ways to do that (see below), and I seem to lack some knowledge to understand why all of them fail.
This test_reduce()
procedure does lots of things. Some of those are more relevant to the question than others (and I list them below):
multiprocessing
module (sic!), namely a pool.Pool
instance,numpy
and scikit-learn
libs,dill
lib to pickle some stuff.First I tried to use a multiprocessing.dummy.Pool
(which seems to be a thread pool). I don't know what is specific about this pool and why it is, eh, "dummy"; the whole thing worked, and I got my results. The problem is CPU load. For parallelized sections of test_reduce()
it was 100% for all cores; for synchronous sections it was around 40-50% most of the time. I can't say there was any increase in overall speed for this type of "parallel" execution.
Then I tried to use a multiprocessing.pool.Pool
instance to map
this procedure to my data. It failed with the following:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed
I made a guess that cPickle
is to blame, and found the pathos
lib that uses a far more advanced pickler dill
. However it also fails:
File "/local/lib/python2.7/site-packages/dill/dill.py", line 199, in load
obj = pik.load()
File "/usr/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib/python2.7/pickle.py", line 1083, in load_newobj
obj = cls.__new__(cls, *args)
TypeError: object.__new__(generator) is not safe, use generator.__new__()
Now, this error is something I don't understand at all. I've got no output to stdout
from my procedure when it works in a pool, so it's hard to guess what's going on. The only thing I know is that test_reduce()
runs successfully when no multiprocessing is used.
So, how would you run in parallel something that heavy and complicated?
Upvotes: 4
Views: 1261
Reputation: 1962
So, thanks to @MikeMcKerns' answer, I found how to get the job done with the pathos
lib. I needed to get rid of all pymongo
cursors, which (being generators) could not be pickled by dill
; doing that solved the problem and I managed to run my code in parallel.
Upvotes: 1