Reputation: 2743
I have a pandas dataframe (pd.DataFrame) with the following structure:
In [175]: df.dtypes.value_counts()
Out[175]:
int64 876
float64 206
object 76
bool 9
dtype: int64
In [176]: df.shape
Out[176]: (9764, 1167)
I've stored the data to disk in the following three ways:
In [170]: df.to_csv('df.csv')
In [171]: df.to_pickle('df_v1.pkl')
In [172]: import pickle
In [173]: with open('df_v2.pkl', 'wb') as handle:
.....: pickle.dump(df, handle)
The sizes of the files on disk are as follows:
df.csv: 26.4 MB
df_v1.pkl: 90.5 MB
df_v2.pkl: 340.4 MB
The csv
is understandably small - it has no pandas overhead to save (that is, it doesn't have to save dataframe dtypes, etc.)
What I don't understand is why the pickle
s from the two different pickle
-ing methods differ so much in size! Also, is one preferred over the other? And what about backwards compatibility?
Upvotes: 5
Views: 2489
Reputation: 33803
Looking at the source code for to_pickle
, pandas chooses the most efficient protocol possible when it pickles the DataFrame. By default, pickle.dump
uses an ASCII protocol, which is the least efficient protocol in terms of file size. This is done to ensure compatibility, and make it easier to recover as the ASCII protocol is human readable.
The equivalent for your code would be altering the pickle.dump
line to:
pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)
I'd just use the to_pickle
method, as it results in cleaner code. There shouldn't be any backwards compatibility issues, unless you need compatibility with a very old version of Python; the more efficient pickle
protocols were introduced in Python 2.3.
Another thing to note is that pandas uses cPickle
for improved performance, and not pickle
itself. This shouldn't impact the file size, but it's another potential difference between the two. In general, you should use cPickle
whenever possible, and only use pickle
when what you want to do isn't supported by cPickle
.
Upvotes: 4