SilentSun
SilentSun

Reputation: 21

How to append result of multiprocessing Pool based on index from input list in Python?

Overall, my script is taking an input of:

  1. Address Search Query
  2. Lat/Lon Coordinates

What I need to do is call geocoding API to get response for each address in the query list, parse the XML response to get the information I need, and check if the newly returned point matches the point on file.

I have this set up working fine until I tried to use the multiprocessing function in Python to help speed up the task.

When using multiprocessing, I can get a final result but the issue that arises is from the random ordering of processing, the multiprocessing result I receive is not matched up with the correct input query.

e.g. "123 Main Street" result appends to "431 Main Street" and "431 Main Street" has result appending to "123 Main Street"

My question is: How do I get the multiprocessing result to append to the correct query rather than appending based on the order of processing?

I am using Pandas Data Frame to keep track of the data.

Portion related:

    def apiRequest(query):
        url = 'URL goes here'
        parameters = {'q':query,'other parameters are here'}
        request = requests.get(url,params=parameters) 
        result = ET.fromstring(request.text)
    return(result)

    results = pool.map(apiRequest,queryList)

    #This is where I append the result where order is based on multiprocessing result list
    i=0
    for result in results:
        df.loc[result[i],'Result Text'] = result
        i=i+1  

Edit: Linked thread is very similar but not exactly what I needed. I found out from comment below that multiprocessing list does return in order of input list not order of processing. With this information I realized I just needed to reference the index of the response. I did this using the enumerate function in the attached thread, so it was helpful.

Another issue unrelated now.. it seems the multiprocessing just isn't working. Takes double the time it was taking before. Fix one issue and another arises!

Thanks for the help!

Upvotes: 2

Views: 1480

Answers (1)

Peter Gibson
Peter Gibson

Reputation: 19564

The results from pool.map are returned in the order matching the input data. Consider the following example.

from multiprocessing import Pool
import time, random

def f(x):
    t = random.random() # sleep for a random time to mix up the results
    time.sleep(t)
    print(x)
    return (t, str(x))

if __name__ == '__main__':
    p = Pool(3) # 3 worker threads
    data = range(10)
    print(p.map(f, data))

Which results in:

1
2
4
5
0
3
7
6
8
9
[(0.8381880180345248, '0'), (0.3361198414214449, '1'), (0.48073509426290906, '2'), (0.5767279178958461, '3'), (0.14369537417791844, '4'), (0.1914456539782432, '5'), (0.7090097213160568, '6'), (0.624456052752851, '7'), (0.79705548172654, '8'), (0.9956179715628799, '9')]

Note that even though the results are computed out of order due to the random delays, the result list is in the correct order.

I suspect the problem is the way you're handling the results.

#This is where I append the result where order is based on multiprocessing result list
i=0
for result in results:
    df.loc[result[i],'Result Text'] = result
    i=i+1  

You're already iterating through results, so why then do you index the result with an incrementing number?

Instead it sounds like you should reference the matching input data from queryList, for instance:

for query, result in zip(queryList, results):
    # this is probably not quite right, but basically do something
    # with query and result
    df.loc[query,'Result Text'] = result

Upvotes: 4

Related Questions