Reputation: 327
I'm trying to speed-up some multiprocessing code in Python 3. I have a big read-only DataFrame
and a function to make some calculations based on the read values.
I tried to solve the issue writing a function inside the same file and share the big DataFrame
as you can see here. This approach does not allow to move the process function to another file/module and it's a bit weird to access a variable outside the scope of the function.
import pandas as pd
import multiprocessing
def process(user):
# Locate all the user sessions in the *global* sessions dataframe
user_session = sessions.loc[sessions['user_id'] == user]
user_session_data = pd.Series()
# Make calculations and append to user_session_data
return user_session_data
# The DataFrame users contains ID, and other info for each user
users = pd.read_csv('users.csv')
# Each row is the details of one user action.
# There is several rows with the same user ID
sessions = pd.read_csv('sessions.csv')
p = multiprocessing.Pool(4)
sessions_id = sessions['user_id'].unique()
# I'm passing an integer ID argument to process() function so
# there is no copy of the big sessions DataFrame
result = p.map(process, sessions_id)
Things I've tried:
sessions.loc...
line of code. This approach slow down the script a lot.Also, I've looked at How to share pandas DataFrame object between processes? but didn't found a better way.
Upvotes: 2
Views: 1356
Reputation: 2670
You can try defining process as:
def process(sessions, user):
...
And put it wherever you prefer.
Then when you call the p.map
you can use the functools.partial
function, that allow to incrementally specify arguments:
from functools import partial
...
p.map(partial(process, sessions), sessions_id)
This should not slow the processing too much and answer to your issues.
Note that you could do the same without partial
as well, using:
p.map(lambda id: process(sessions,id)), sessions_id)
Upvotes: 2