Reputation: 188
I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?
Below is the basic process I used:
import pandas as pd
import dask.dataframe as dd
ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')
npartitions
but no number has allowed the code to finish running.Upvotes: 3
Views: 717
Reputation: 13447
Have you considered to first convert the pickle
files to parquet
and then load to dask? I assume that all your data is in a folder called raw
and you want to move to processed
import pandas as pd
import dask.dataframe as dd
import os
def convert_to_parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".pickle", ".parquet")
df = pd.read_pickle(fn)
# eventually change dtypes
df.to_parquet(fn_out, index=False)
fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)
# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]
If you know than no more than one file fits in memory you should use a loop
for fn in fns:
convert_to_parquet(fn, fldr_in, fldr_out)
If you know that more files fit in memory you can use delayed
from dask import delayed, compute
# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)
Now you can use dask to do your analysis.
Upvotes: 1