Arjun
Arjun

Reputation: 907

Store a Dask DataFrame as a pickle

I have a Dask DataFrame constructed as follows:

import dask.dataframe as dd

df = dd.read_csv('matrix.txt', header=None)
type(df) //dask.dataframe.core.DataFrame

Is there way to save this DataFrame as a pickle?

For example,

df.to_pickle('matrix.pkl')

Upvotes: 3

Views: 5337

Answers (2)

Ignacio Vergara Kausel
Ignacio Vergara Kausel

Reputation: 5986

From a quick inspection of the methods available in dask that is not directly possible. It's still possible to do as the other answer, but I fear that due to the eventual distributed nature of a dask dataframe it might be not straightforward.

Anyhow, if I were you, I'd go through another solution and use parquet as a storage. It offers you basically the same advantages of pickle, and more.

df.to_parquet('my_file.parquet')

Although, if your plan is to use pickle as a 'suspend' method to later on resume computation, saving to parquet would not really help.

My advise would be by far to use parquet. Look at this post where different technologies to store a general pandas dataframe are compared. You'll see that they don't even discuss pickle (which has some issues like it might be incompatible between two python versions). The article is slightly old, and now pandas/dask can directly work with parquet without the need of explicitly using pyarrow.

I guess that you're interested in reading time. There is always a tradeoff between file size and read time. Although in the article is shown that when you factor in multiple core operation you can get similar read performance with compressed parquet file (Parquet-snappy column)

enter image description here

Thus, I'll repeat myself. Go for parquet file and you'll future-proof yourself. Unless that your use case is very different than columnar/dataframe oriented one.

Upvotes: 8

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

You can try pickling it like you would do with any other object - import pickle

with open('filename.pickle', 'wb') as handle:
    pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('filename.pickle', 'rb') as handle:
    b = pickle.load(handle)
print(a == b)

Further, please check this on safety of pickling dask dataframes and in what situations in might break

Upvotes: 5

Related Questions