NSZ
NSZ

Reputation: 235

Pandas multiprocessing on very large dataframe

I'm trying to use the multiprocessing package to compute a function on a very large Pandas dataframe. However I ran into a problem with the following error:

OverflowError: cannot serialize a bytes objects larger than 4GiB

After applying the solution to this question and using protocol 4 for pickling, I ran into the following error instead, which is also quoted by the solution itself:

error: 'i' format requires -2147483648 <= number <= 2147483647

The answer to this question then suggests to use the dataframe as a global variable. But ideally I would like the dataframe to still be an input of the function, without having the multiprocessing library copying and pickling it multiple times in the background.

Is there some other way I can design the code to not run into the issue?

I was able to replicate the problem with this example:

import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import functools

print('Total memory usage for the dataframe: {} GB'.format(df.memory_usage().sum() / 1e9))

def slow_function(some_parameter, df):
  time.sleep(1)
  return some_parameter

parameters = list(range(100))

with mp.Pool(20) as pool:
  function = functools.partial(slow_function, df=df)

  results = pool.map(function, parameters)

Upvotes: 1

Views: 623

Answers (1)

user8560167
user8560167

Reputation:

try Dask

import dask.dataframe as dd

df = dd.read_csv('data.csv')

docs : https://docs.dask.org/en/latest/dataframe-api.html

Upvotes: 1

Related Questions