Testing
Testing

Reputation: 1

Share same Pandas Dataframe between process Pool without copying again and again

I have a dataframe which holds a query result for about 1 million or more

When i pass this to map function which performs comparison of two dataframes , the above mentioned dataframe gets copied for every process and gives me memory error.

Sample code

df = pd.read_sql_query('Query returning 1 million or more rows')

def comparison(df):
   
# Having comparison logic which uses the df object mentioned above

p = Pool(2)
fn = partial(comparison,df)
p.map(fn,'some iterator')

now what i want is on mapping comparison function to different processes , it shoul not copy the df again and again

I have tried moving the query fetching part i.e the df inside the compariosn function , it works but gets executed again and again for each iterator object , since the query takes 40 - 50 seconds to execute , this is a time overhead everytime . Therfore i only wat to do it once and use it everytime

Upvotes: 0

Views: 179

Answers (1)

AKX
AKX

Reputation: 168966

i am on Windows and this df object is in main function

Then you're out of luck.

Since there isn't copy-on-write memory on Windows, you can't share a Python variable transparently between multiple processes without copying occurring.

(Copy-on-write mmaps do exist but they can't be the backing memory for dfs to the best of my knowledge.)

Upvotes: 2

Related Questions