Unable to go from pandas to dask dataframe, memory error

Question

I have a pandas dataframe that has 7 million records, I am trying to create a dask dataframe but I keep running into memory issues.

Code used:

dd_test = dd.from_pandas(df_lookup_table, npartitions=3)

Error message:

Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Users\user\venv\lib\site-packages\dask\dataframe\io\io.py", line 181, in from_pandas
    name = name or ('from_pandas-' + tokenize(data, chunksize))
  File "C:\Users\user\venv\lib\site-packages\dask\base.py", line 600, in tokenize
    return md5(str(tuple(map(normalize_token, args))).encode()).hexdigest()
  File "C:\Users\user\venv\lib\site-packages\dask\utils.py", line 413, in __call__
    return meth(arg, *args, **kwargs)
  File "C:\Users\user\venv\lib\site-packages\dask\base.py", line 710, in normalize_dataframe
    return list(map(normalize_token, data))
  File "C:\Users\user\venv\lib\site-packages\dask\utils.py", line 413, in __call__
    return meth(arg, *args, **kwargs)
  File "C:\Users\user\venv\lib\site-packages\dask\base.py", line 734, in normalize_array
    x.flat]))
MemoryError

I was able to create a dask dataframe with a smaller dataframe. How can I create a dask dataframe from this pandas dataframe?

mdurant · Accepted Answer

The point of dask is to be able to process data which doesn't fit into memory. In this case, you are loading the data-set into memory first, before passing it to dask. Instead, you should load the data directly with dask. For example, if you used pandas.read_csv, you should switch this to dask.dataframe.read_csv.

Unable to go from pandas to dask dataframe, memory error

Answers (1)

Related Questions