Reputation: 615
I've seen bundles of questions on here about multiprocessing, none of which seem to answer my specific problem.
Obviously my problem/function is much more complex, but I've tried to simplify it as much as possible. My function currently takes around 5 mins to run an instance of so would be great if I could get this working in parallel.
Below is basically how I am running things currently, looping thru each 'name' in my list, passing it thru the function and then appending it to the dictionary.
df = pd.DataFrame({'Matthew': [4, 9, 6], 'Mark': [2, 3, 5], 'Luke': [10, 1, 8], 'John': [20, 22, 21]})
def sum_funct(name,df):
df = df
return int(df[name].sum())
totals_dict = {}
names = ['Matthew', 'Mark', 'Luke', 'John']
for name in names:
totals_dict[name] = sum_funct(name,df)
print(totals_dict)
{'Matthew': 19, 'Mark': 10, 'Luke': 19, 'John': 63}
What I'd love to be able to do is use multiprocessing in the for name in names:
bit, but so far I can't find anything about being able to use the values a function returns while multiprocessing. I've come across some answers which touch on functions that return values but none have been of any help.
Upvotes: 0
Views: 86
Reputation: 26993
The ProcessPoolExecutor from concurrent.futures could be used for this. For example:
from concurrent.futures import ProcessPoolExecutor
from pandas import DataFrame
from functools import partial
def sum_funct(df, name):
return name, df[name].sum()
def main():
dict_ = {'Matthew': [4, 9, 6], 'Mark': [2, 3, 5], 'Luke': [10, 1, 8], 'John': [20, 22, 21]}
with ProcessPoolExecutor() as executor:
total_dict = dict(executor.map(partial(sum_funct, DataFrame(dict_)), dict_))
print(total_dict)
if __name__ == '__main__':
main()
Output:
{'Matthew': 19, 'Mark': 10, 'Luke': 19, 'John': 63}
Note:
If the dictionary has more keys than you have CPUs you should probably consider calculating a suitable value for max_workers (passed to ProcessPoolExecutor)
Upvotes: 2