Reputation: 235
I'm trying to use the multiprocessing package to compute a function on a very large Pandas dataframe. However I ran into a problem with the following error:
OverflowError: cannot serialize a bytes objects larger than 4GiB
After applying the solution to this question and using protocol 4 for pickling, I ran into the following error instead, which is also quoted by the solution itself:
error: 'i' format requires -2147483648 <= number <= 2147483647
The answer to this question then suggests to use the dataframe as a global variable. But ideally I would like the dataframe to still be an input of the function, without having the multiprocessing library copying and pickling it multiple times in the background.
Is there some other way I can design the code to not run into the issue?
I was able to replicate the problem with this example:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import functools
print('Total memory usage for the dataframe: {} GB'.format(df.memory_usage().sum() / 1e9))
def slow_function(some_parameter, df):
time.sleep(1)
return some_parameter
parameters = list(range(100))
with mp.Pool(20) as pool:
function = functools.partial(slow_function, df=df)
results = pool.map(function, parameters)
Upvotes: 1
Views: 623
Reputation:
try Dask
import dask.dataframe as dd
df = dd.read_csv('data.csv')
docs : https://docs.dask.org/en/latest/dataframe-api.html
Upvotes: 1