Iterating over a powerset with multiprocessing

Question

I have a generator for a powerset of a list in python and I want do some calculations to the elements of these set using the multiprocessing module. My code looks like:

def powerset(seq): 
  '''Returns all the subsets of the list. This is a generator.'''
  if len(seq) == 0:
    yield seq
  if len(seq) == 1:
    yield seq 
    yield []
  elif len(seq) > 1: 
    for item in powerset(seq[1:]):
      yield [seq[0]]+item
      yield item

def job(l):
  # do some calculation with the list l
  return do_some_hard_work(l)

def calculate():
  pool_size = multiprocessing.cpu_count() * 2
  pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2)
  pool_outputs = pool.map(job, powerset(list(range(1,10)))
  pool.close()
  pool.join()

  return sum(pool_outputs)

The problem is that the powerset-function is a generator and that will not work. But I can not replace the generator, because of generate the hole powerset before the calculation needs to much time and memory. Has anyone an idea how I can solve this problem?

dano · Accepted Answer

If the issue is that you want to avoid having to put the whole powerset in a list, you can use pool.imap, which will consume your iterator chunksize elements at a time, and send those off to the worker processes, rather than converting the whole thing into a list and chunking that up.

pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size, maxtasksperchild=2)
pool_outputs = pool.imap(job, powerset(list(range(1,10))), chunksize=)
pool.close()
pool.join()

If your powerset is very large, you'll want to specify a chunksize other than the default, which is 1:

The chunksize argument is the same as the one used by the map() method. For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

The map function uses the following algorithm, to give you an idea of a good size:

chunksize, extra = divmod(len(iterable), len(pool_size) * 4)
if extra:
    chunksize += 1

Iterating over a powerset with multiprocessing

Answers (1)

Related Questions