rhombidodecahedron
rhombidodecahedron

Reputation: 7922

Creating objects in parallel with a multiprocessing pool?

Python 2.7.3

I have a folder containing thousands of data files. Each data file gets fed to a constructor and heavily processed. Right now I am iterating through the files and processing them sequentially:

class Foo:
    def __init__(self,file):
        self.bar = do_lots_of_stuff_with_numpy_and_scipy(file)

def do_lots_of_stuff_with_numpy_and_scipy(file):
    pass

def get_foos(dir):
    return [Foo(os.path.join(dir,file)) for file in os.listdir(dir)]

This works beautifully but is so slow. I would like to do this in parallel. I tried:

def parallel_get_foos(dir):
    p = Pool()
    foos = p.map(Foo, [os.path.join(dir,file) for file in os.listdir(dir)])
    p.close()
    p.join()
    return foos

if __name__ == "__main__":
    foos = parallel_get_foos(sys.argv[1])

But it just errors out with lots of these:

Process PoolWorker-7:
Traceback (most recent call last):
  File "/l/python2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/l/python2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/l/python2.7/lib/python2.7/multiprocessing/pool.py", line 99, in worker
    put((job, i, result))
  File "/l/python2.7/lib/python2.7/multiprocessing/queues.py", line 390, in put
    return send(obj)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

I have tried making a function to return the object, e.g.:

def get_foo(file):
    return Foo(file)

def parallel_get_foos(dir):
    ...
    foos = p.map(get_foo, [os.path.join(dir,file) for file in os.listdir(dir)])
    ...

but as expected I get the same error.

I have read through a great number of similar threads trying to address problems somewhat like this one but none of the solutions have helped me. So I appreciate any help!

EDIT:

Bakuriu correctly surmised that I am defining a non-top-level function inside of my do_lots_of_stuff method. In particular, I am doing as follows:

def fit_curve(data,degree):
    """Fits a least-square polynomial function to the given data."""
    sorted = data[data[:,0].argsort()].T
    coefficients = numpy.polyfit(sorted[0],sorted[1],degree)
    def eval(val,deg=degree):
        res = 0
        for coefficient in coefficients:
            res += coefficient*val**deg
            deg -= 1
        return res
    return eval

Is there anyway to make this function pickleable?

Upvotes: 1

Views: 4361

Answers (1)

Bakuriu
Bakuriu

Reputation: 102029

What you are doing(at least, what you show in the examples), actually works fine:

$mkdir TestPool
$cd TestPool/
$for i in {1..100}
> do
>     touch "test$i"
> done
$ls
test1    test18  test27  test36  test45  test54  test63  test72  test81  test90
test10   test19  test28  test37  test46  test55  test64  test73  test82  test91
test100  test2   test29  test38  test47  test56  test65  test74  test83  test92
test11   test20  test3   test39  test48  test57  test66  test75  test84  test93
test12   test21  test30  test4   test49  test58  test67  test76  test85  test94
test13   test22  test31  test40  test5   test59  test68  test77  test86  test95
test14   test23  test32  test41  test50  test6   test69  test78  test87  test96
test15   test24  test33  test42  test51  test60  test7   test79  test88  test97
test16   test25  test34  test43  test52  test61  test70  test8   test89  test98
test17   test26  test35  test44  test53  test62  test71  test80  test9   test99
$vi test_pool_dir.py
$cat test_pool_dir.py 
import os
import multiprocessing

class Foo(object):
    def __init__(self, fname):
        self.fname = fname   #or your calculations


def parallel_get_foos(directory):
    p = multiprocessing.Pool()
    foos = p.map(Foo, [os.path.join(directory, fname) for fname in os.listdir(directory)])
    p.close()
    p.join()
    return foos

if __name__ == '__main__':
    foos = parallel_get_foos('.')
    print len(foos)   #expected 101: 100 files plus this script

$python test_pool_dir.py 
101

Version information:

$python --version
Python 2.7.3
$uname -a
Linux giacomo-Acer 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

My guess is that you are not doing exactly what you show in the code samples you showed. For example I get an error similar to yours when doing this:

>>> import pickle
>>> def test():
...     def test2(): pass
...     return test2
... 
>>> import multiprocessing
>>> p = multiprocessing.Pool()
>>> p.map(test(), [1,2,3])
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

Which is obvious since:

>>> pickle.dumps(test())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/pickle.py", line 1374, in dumps
    Pickler(file, protocol).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 748, in save_global
    (obj, module, name))
pickle.PicklingError: Can't pickle <function test2 at 0x7fad15fc2938>: it's not found as __main__.test2

And pickle's documentation states that:

The following types can be pickled:

  • None, True, and False
  • integers, long integers, floating point numbers, complex numbers
  • normal and Unicode strings
  • tuples, lists, sets, and dictionaries containing only picklable objects
  • functions defined at the top level of a module
  • built-in functions defined at the top level of a module
  • classes that are defined at the top level of a module
  • instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section The pickle protocol for details).

And continues:

Note that functions (built-in and user-defined) are pickled by “fully qualified” name reference, not by value. This means that only the function name is pickled, along with the name of the module the function is defined in. Neither the function’s code, nor any of its function attributes are pickled. Thus the defining module must be importable in the unpickling environment, and the module must contain the named object, otherwise an exception will be raised.

Upvotes: 1

Related Questions