Reputation: 533
Big file compression with python gives a very nice example on how to use e.g. bz2 to compress a very large set of files (or a big file) purely in Python.
pigz says you can do better by exploiting parallel compression. To my knowledge (and Google search) insofar I cannot find a Python equivalent to do so in pure Python code.
Is there a parallel Python implementation for pigz
or equivalent?
Upvotes: 16
Views: 11090
Reputation: 3622
mgzip is able to achieve this
Using a block indexed GZIP file format to enable compress and decompress in parallel. This implement use 'FEXTRA' to record the index of compressed member, which is defined in offical GZIP file format specification version 4.3, so it is fully compatible with normal GZIP implementation.
import mgzip
num_cpus = 0 # will use all available CPUs
with open('original_file.txt', 'rb') as original, mgzip.open(
'gzipped_file.txt.gz', 'wb', thread=num_cpus, blocksize=2 * 10 ** 8
) as fw:
fw.write(original.read())
I was able to speed up compression from 45min to 5min on a 72 CPUs server
Upvotes: 8
Reputation: 112339
You can use the flush()
operation with Z_SYNC_FLUSH
to complete the last deflate block and end it on a byte boundary. You can concatenate those to make a valid deflate stream, so long as the last one you concatenate is flushed with Z_FINISH
(which is the default for flush()
).
You would also want to compute the CRC-32 in parallel (whether for zip or gzip -- I think you really mean parallel gzip compression). Python does not provide an interface to zlib's crc32_combine()
function. However you can copy the code from zlib and convert it to Python. It will be fast enough that way, since it doesn't need to be run often. Also you can pre-build the tables you need to make it faster, or even pre-build a matrix for a fixed block length.
Upvotes: 2
Reputation: 155363
I don't know of a pigz
interface for Python off-hand, but it might not be that hard to write if you really need it. Python's zlib
module allows compressing arbitrary chunks of bytes, and the pigz
man page describes the system for parallelizing the compression and the output format already.
If you really need parallel compression, it should be possible to implement a pigz
equivalent using zlib
to compress chunks wrapped in multiprocessing.dummy.Pool.imap
(multiprocessing.dummy
is the thread-backed version of the multiprocessing
API, so you wouldn't incur massive IPC costs sending chunks to and from the workers) to parallelize the compression. Since zlib
is one of the few built-in modules that releases the GIL during CPU-bound work, you might actually gain a benefit from thread based parallelism.
Note that in practice, when the compression level isn't turned up that high, I/O is often of similar (within order of magnitude or so) cost to the actual zlib
compression; if your data source isn't able to actually feed the threads faster than they compress, you won't gain much from parallelizing.
Upvotes: 6