Reputation: 3461
From the Python documentation:
By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.
I'm going to be serializing several gigabytes of data at the end of a process that runs for several hours, and I'd like the result to be as small as possible on disk. However, Python offers several different ways to compress data.
Is there one of these that's particularly efficient for pickled files? The data I'm pickling mostly consists of nested dictionaries and strings, so if there's a more efficient way to compress e.g. JSON, that would work too.
The time for compression and decompression isn't important, but the time this process takes to generate the data makes trial and error inconvenient.
Upvotes: 49
Views: 45432
Reputation: 185
In addition to the previous answers there is also the compress_pickle
module (documentation), serving as a wrapper for the pickle module in combination with different compression protocols (e.g., gzip, lzma).
In my case with pickling objects of various types, compress_pickle
with gzip
option outperformed blosc
in regards of space always by a compression factor of 10, but was also three times slower.
Upvotes: 2
Reputation: 354
Just adding an alternative that easily provided me with the highest compression ratio and on top of that did it so fast I was sure I made a mistake somewhere (I didn't). The real bonus is that the decompression is also very fast, so any program that reads in lots of preprocessed data, for example, will benefit hugely from this. One potential caveat is that there is mention of "small arrays (<2GB)" somewhere here, but it looks like there are ways around that. Or, if you're lazy like me, breaking up your data instead is usually an option.
Some smart cookies came up with python-blosc. It's a "high performance compressor", according to their docs. I was lead to it from an answer to this question.
Once installed via, e.g. pip install blosc
or conda install python-blosc
, you can compress pickled data pretty easily as follows:
import blosc
import numpy as np
import pickle
data = np.random.rand(3, 3, 1e7)
pickled_data = pickle.dumps(data) # returns data as a bytes object
compressed_pickle = blosc.compress(pickled_data)
with open("path/to/file/test.dat", "wb") as f:
f.write(compressed_pickle)
And to read it:
with open("path/to/file/test.dat", "rb") as f:
compressed_pickle = f.read()
depressed_pickle = blosc.decompress(compressed_pickle)
data = pickle.loads(depressed_pickle) # turn bytes object back into data
I'm using Python 3.7 and without even looking at all the different compression options I got a compression ratio of about 12 and reading + decompressing + loading the compressed pickle file took a fraction of a second longer than loading the uncompressed pickle file.
I wrote this more as a reference for myself, but I hope someone else will find this useful.
Peace oot
Upvotes: 20
Reputation: 4170
I've done some test using a Pickled object, lzma
gave the best compression.
But your results can vary based on your data, I'd recommend testing them with some sample data of your own.
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 9/17/2019 10:05 PM 23869925 no_compression.pickle
-a---- 9/17/2019 10:06 PM 6050027 gzip_test.gz
-a---- 9/17/2019 10:06 PM 3083128 bz2_test.pbz2
-a---- 9/17/2019 10:07 PM 1295013 brotli_test.bt
-a---- 9/17/2019 10:06 PM 1077136 lzma_test.xz
Test file used (you'll need to pip install brotli
or remove that algorithm):
import bz2
import gzip
import lzma
import pickle
import brotli
class SomeObject():
a = 'some data'
b = 123
c = 'more data'
def __init__(self, i):
self.i = i
data = [SomeObject(i) for i in range(1, 1000000)]
with open('no_compression.pickle', 'wb') as f:
pickle.dump(data, f)
with gzip.open("gzip_test.gz", "wb") as f:
pickle.dump(data, f)
with bz2.BZ2File('bz2_test.pbz2', 'wb') as f:
pickle.dump(data, f)
with lzma.open("lzma_test.xz", "wb") as f:
pickle.dump(data, f)
with open('no_compression.pickle', 'rb') as f:
pdata = f.read()
with open('brotli_test.bt', 'wb') as b:
b.write(brotli.compress(pdata))
Upvotes: 55
Reputation: 568
mgzip is a much faster solution. lzma is painfully slow, although it has about 25% better compression than mgzip.
with mgzip.open(pathname, 'wb') as f:
pickle.dump(data, f)
For loading:
with mgzip.open(pathname, 'rb') as f:
data = pickle.load(f)
Upvotes: 1
Reputation: 11671
I took the "efficiently compress pickled data" to mean that general-purpose compressors tend to work well. But Pickle is a protocol, not a format per se. It's possible to make pickle emit compressed bytestrings by implementing the __reduce__
method on your custom classes. Trying to compress those further wouldn't work well.
Of the standard library compressors, LZMA will tend give you the best ratio on typical data streams, but it's also the slowest. You can probably do even better using ZPAQ (via pyzpaq
, say), but that's even slower.
Upvotes: 5