user2523485
user2523485

Reputation:

Python multiprocessing - Return a dict

I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:

def toParallel(ht, token):
    keys = []
    words = token[token['hashtag'] == ht]['word']
    for w in words:
        keys.append(checkString(w))
    y = {ht:keys}

num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)

token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])

result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)

final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]

Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only

bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950

How can I do it?

Upvotes: 5

Views: 8294

Answers (1)

Ricardo Cruz
Ricardo Cruz

Reputation: 3593

final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]

You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.

Something like this:

workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]

Alternatively, you can also use Pool.map() which will do all this for you.

Either way, I recommend you read the documentation carefully.


Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.


ps: this is unnecessary:

num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)

If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.

Upvotes: 2

Related Questions