jb4earth
jb4earth

Reputation: 188

Pandas dataframes too large to append to dask dataframe?

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')

Upvotes: 3

Views: 717

Answers (1)

rpanai
rpanai

Reputation: 13447

Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

import pandas as pd
import dask.dataframe as dd
import os

def convert_to_parquet(fn, fldr_in, fldr_out):
    fn_out = fn.replace(fldr_in, fldr_out)\
               .replace(".pickle", ".parquet")
    df = pd.read_pickle(fn)
    # eventually change dtypes
    
    df.to_parquet(fn_out, index=False)

fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)

# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]

If you know than no more than one file fits in memory you should use a loop

for fn in fns:
    convert_to_parquet(fn, fldr_in, fldr_out)

If you know that more files fit in memory you can use delayed

from dask import delayed, compute

# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)

Now you can use dask to do your analysis.

Upvotes: 1

Related Questions