Hyperion
Hyperion

Reputation: 2625

multiprocessing pool.map not processing list in order

I have this script to process some urls in parallel:

import multiprocessing
import time

list_of_urls = []

for i in range(1,1000):
    list_of_urls.append('http://example.com/page=' + str(i))

def process_url(url):
    page_processed = url.split('=')[1]
    print 'Processing page %s'% page_processed
    time.sleep(5)

pool = multiprocessing.Pool(processes=4)
pool.map(process_url, list_of_urls)

The list is ordered, but when I run it, the script doesn't pick urls from list in order:

Processing page 1
Processing page 64
Processing page 127
Processing page 190
Processing page 65
Processing page 2
Processing page 128
Processing page 191

Instead, I would like it to process page 1,2,3,4 at first, then continue following the order in the list. Is there an option to do this?

Upvotes: 8

Views: 7950

Answers (2)

grzgrzgrz3
grzgrzgrz3

Reputation: 350

If you do not pass argument chunksize, then map will calculate chunks using this algorithm:

chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
   chunksize += 1

It's cutting your iterable into task_batches and running it on separate processes. That is why it's not in order. The solution is to declare the chunk size equal to 1.

import multiprocessing
import time

list_test = range(10)

def process(task):
    print "task:", task
    time.sleep(1)

pool = multiprocessing.Pool(processes=3)
pool.map(process, list_test, chunksize=1)

task: 0
task: 1
task: 2
task: 3
task: 4
task: 5
task: 6
task: 7
task: 8
task: 9

Upvotes: 12

cranky_monkey
cranky_monkey

Reputation: 21

Multiprocessing is an asynchronous operation, meaning it is by definition non-sequential. Threads (or in python's case processes) pull urls from your list, and there is no guarantee which process will finish first. So url 1 might begin processing before url 64, but because of randomness in network I/O for example, url 64 might finish first.

Ask yourself if you truly need to perform these operations in order, first. If the answer is yes, your best bet is to perform a blocking step- one that forces all parallel computations to complete first, and then sort that completed result.

So if your list of urls is very large, and you want some element of order yet still take advantage of parallelization, you can chunk your list, and then sequentially run each chunk through your parallel logic above.

Upvotes: 0

Related Questions