Pandas dataframes too large to append to dask dataframe?

Question

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')

I've tried a variety of npartitions but no number has allowed the code to finish running.
all in all there is about 30GB of pickled dataframes I'd like to combine
perhaps this is not the right library but the docs suggest dask should be able to handle this

Pandas dataframes too large to append to dask dataframe?

Answers (1)

Related Questions