Reputation: 907
I have a Dask DataFrame constructed as follows:
import dask.dataframe as dd
df = dd.read_csv('matrix.txt', header=None)
type(df) //dask.dataframe.core.DataFrame
Is there way to save this DataFrame as a pickle?
For example,
df.to_pickle('matrix.pkl')
Upvotes: 3
Views: 5337
Reputation: 5986
From a quick inspection of the methods available in dask
that is not directly possible. It's still possible to do as the other answer, but I fear that due to the eventual distributed nature of a dask dataframe it might be not straightforward.
Anyhow, if I were you, I'd go through another solution and use parquet as a storage. It offers you basically the same advantages of pickle, and more.
df.to_parquet('my_file.parquet')
Although, if your plan is to use pickle as a 'suspend' method to later on resume computation, saving to parquet would not really help.
My advise would be by far to use parquet. Look at this post where different technologies to store a general pandas dataframe are compared. You'll see that they don't even discuss pickle (which has some issues like it might be incompatible between two python versions). The article is slightly old, and now pandas/dask can directly work with parquet without the need of explicitly using pyarrow
.
I guess that you're interested in reading time. There is always a tradeoff between file size and read time. Although in the article is shown that when you factor in multiple core operation you can get similar read performance with compressed parquet file (Parquet-snappy column)
Thus, I'll repeat myself. Go for parquet
file and you'll future-proof yourself. Unless that your use case is very different than columnar/dataframe oriented one.
Upvotes: 8
Reputation: 9081
You can try pickling it like you would do with any other object - import pickle
with open('filename.pickle', 'wb') as handle:
pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)
with open('filename.pickle', 'rb') as handle:
b = pickle.load(handle)
print(a == b)
Further, please check this on safety of pickling dask dataframes and in what situations in might break
Upvotes: 5