Why do pd.DataFrame pickle sizes vary so much between df.to_pickle and native Python pickle?

Question

I have a pandas dataframe (pd.DataFrame) with the following structure:

In [175]: df.dtypes.value_counts()
Out[175]: 
int64      876
float64    206
object      76
bool         9
dtype: int64

In [176]: df.shape
Out[176]: (9764, 1167)

I've stored the data to disk in the following three ways:

In [170]: df.to_csv('df.csv')

In [171]: df.to_pickle('df_v1.pkl')

In [172]: import pickle
In [173]: with open('df_v2.pkl', 'wb') as handle:
   .....:     pickle.dump(df, handle)

The sizes of the files on disk are as follows:

df.csv:     26.4 MB 
df_v1.pkl:  90.5 MB
df_v2.pkl: 340.4 MB

The csv is understandably small - it has no pandas overhead to save (that is, it doesn't have to save dataframe dtypes, etc.) What I don't understand is why the pickles from the two different pickle-ing methods differ so much in size! Also, is one preferred over the other? And what about backwards compatibility?

root · Accepted Answer

Looking at the source code for to_pickle, pandas chooses the most efficient protocol possible when it pickles the DataFrame. By default, pickle.dump uses an ASCII protocol, which is the least efficient protocol in terms of file size. This is done to ensure compatibility, and make it easier to recover as the ASCII protocol is human readable.

The equivalent for your code would be altering the pickle.dump line to:

pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)

I'd just use the to_pickle method, as it results in cleaner code. There shouldn't be any backwards compatibility issues, unless you need compatibility with a very old version of Python; the more efficient pickle protocols were introduced in Python 2.3.

Another thing to note is that pandas uses cPickle for improved performance, and not pickle itself. This shouldn't impact the file size, but it's another potential difference between the two. In general, you should use cPickle whenever possible, and only use pickle when what you want to do isn't supported by cPickle.

Why do pd.DataFrame pickle sizes vary so much between df.to_pickle and native Python pickle?

Answers (1)

Related Questions