Reputation: 1
I have a dataframe which holds a query result for about 1 million or more
When i pass this to map function which performs comparison of two dataframes , the above mentioned dataframe gets copied for every process and gives me memory error.
Sample code
df = pd.read_sql_query('Query returning 1 million or more rows')
def comparison(df):
# Having comparison logic which uses the df object mentioned above
p = Pool(2)
fn = partial(comparison,df)
p.map(fn,'some iterator')
now what i want is on mapping comparison function to different processes , it shoul not copy the df again and again
I have tried moving the query fetching part i.e the df inside the compariosn function , it works but gets executed again and again for each iterator object , since the query takes 40 - 50 seconds to execute , this is a time overhead everytime . Therfore i only wat to do it once and use it everytime
Upvotes: 0
Views: 179
Reputation: 168966
i am on Windows and this df object is in main function
Then you're out of luck.
Since there isn't copy-on-write memory on Windows, you can't share a Python variable transparently between multiple processes without copying occurring.
(Copy-on-write mmaps do exist but they can't be the backing memory for dfs to the best of my knowledge.)
Upvotes: 2