Reputation: 33
I have a generator for a powerset of a list in python and I want do some calculations to the elements of these set using the multiprocessing module. My code looks like:
def powerset(seq):
'''Returns all the subsets of the list. This is a generator.'''
if len(seq) == 0:
yield seq
if len(seq) == 1:
yield seq
yield []
elif len(seq) > 1:
for item in powerset(seq[1:]):
yield [seq[0]]+item
yield item
def job(l):
# do some calculation with the list l
return do_some_hard_work(l)
def calculate():
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2)
pool_outputs = pool.map(job, powerset(list(range(1,10)))
pool.close()
pool.join()
return sum(pool_outputs)
The problem is that the powerset-function is a generator and that will not work. But I can not replace the generator, because of generate the hole powerset before the calculation needs to much time and memory. Has anyone an idea how I can solve this problem?
Upvotes: 2
Views: 180
Reputation: 94891
If the issue is that you want to avoid having to put the whole powerset in a list, you can use pool.imap
, which will consume your iterator chunksize
elements at a time, and send those off to the worker processes, rather than converting the whole thing into a list and chunking that up.
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2)
pool_outputs = pool.imap(job, powerset(list(range(1,10))), chunksize=<some chunksize>)
pool.close()
pool.join()
If your powerset is very large, you'll want to specify a chunksize
other than the default, which is 1:
The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
The map
function uses the following algorithm, to give you an idea of a good size:
chunksize, extra = divmod(len(iterable), len(pool_size) * 4)
if extra:
chunksize += 1
Upvotes: 1