Reputation: 2161
I exported a DataFrame
on OS X to pickle
using to_pickle
.
Loading it back on OS X (using read_pickle
) returns the same DataFrame
as expected, but loading it on a Linux system (Debian) using the same function returns a different content.
From several posts it seems that pickle
is guaranteed to be cross-platform when using binary mode (see: Is pickle file of python cross-platform?), but to_pickle
and read_pickle
don't accept any arguments, and I couldn't tell from their documentation if it's binary by default.
How can I know if they are?
How can I make sure that my pickle
files will be identical across platforms?
Notes:
This is a part of the .pickle
file created using to_pickle
:
945d 948c 055f 6461 7461 948c 1570 616e
6461 732e 636f 7265 2e69 6e74 6572 6e61
6c73 948c 0c42 6c6f 636b 4d61 6e61 6765
7294 9394 297d 9492 9428 5d94 288c 1370
Exporting it with a prefix of b
(df.to_pickle(b'pickle_folder/df.pickle'
as opposed to df.to_pickle('pickle_folder/df.pickle'
) doesn't change it's content.
Both python versions are identical (3.4.4).
EDIT
From their source code it seems like their using the highest protocol and binary reading/writing. That answers my first question. Still looking for a reason why they are different between platforms.
Upvotes: 1
Views: 603
Reputation: 210882
I can't directly answer your question:
why they are different between platforms?
But as a workaround you can use a standard HDF5 format, which will work on all platforms and has nice features:
where='where clause'
argument (those columns must be indexed - check data_columns
argument). So you may have huge amount of data in the HDF5 files and you can process it in chunks, efficiently reading (using indexes) chunks into memory. I.e. you don't need to read all the data from disk in order to filter it.blosc
) Storing and reading to/from HDF5 files can be very fast depending on a used dtypes. NOTE: working with strings (dtype: object
) can be much slower comparing to Pickle
format.
Another standard option is to use a central database which should be available for all platforms and give you a possibility to (pre-)filter and sort your data on the DB server side.
Upvotes: 1