I have about 700 Python source files ( .py ) of a few kilobytes of size (average file size is 12 kB, but there are many 1 kB files as well), and I'd like to create a compressed archive containing all of them. My requirements: The archive should be small. ( .zip files give me a compression ratio of 3.816, I need something smaller than that. A .rar file created with rar -s -m5 a gives me a compression ratio of 6.177, I'd prefer 7 or more.) The compression must be lossless, it must preserve the original file bit-by-bit. (So minification is out.) There must be a small library written in C which can list the archive and extract individual files. The decompression library must be fast, i.e. not much slower than zlib , preferably faster. If I want to extract a single file, I don't have to uncompress large, unrelated portions of the archive. (So compressed .tar files are out, and solid .rar files are out.) Since all .py files are small (only a few kilobytes in size), I don't need a streaming decompressor or seeking support within a file. If possible, decompression should be initialized from a context dictionary generated from the union of the .py files, to save more space. Which compression algorithm and C decompression library do you recommend? I know about the concept of code minification (e.g. removing comments and extra whitespace, renaming local variables to single letter), and I'll consider using this technique for some of my .py files, but in this question I'm not interested in it. (See a Python minifier here .) I know about the concept of bytecode compilation ( .pyc files), but in this question I'm not interested in it. (The reason I don't want to have bytecode in the archive is that bytecode is architecture- and version-dependent, so it's less portable. Also .pyc files tend to be a bit larger than minified .py files.) See my answers containing plan B and plan C. I'm still looking for plan A, which is smaller than ZIP (but it will be most probably larger than .tar.xz ), and it has smaller overhead than .tar.xz .

Reputation: 87401

Compression for Python source files

I have about 700 Python source files (.py) of a few kilobytes of size (average file size is 12 kB, but there are many 1 kB files as well), and I'd like to create a compressed archive containing all of them. My requirements:

The archive should be small. (.zip files give me a compression ratio of 3.816, I need something smaller than that. A .rar file created with rar -s -m5 a gives me a compression ratio of 6.177, I'd prefer 7 or more.)
The compression must be lossless, it must preserve the original file bit-by-bit. (So minification is out.)
There must be a small library written in C which can list the archive and extract individual files.
The decompression library must be fast, i.e. not much slower than zlib, preferably faster.
If I want to extract a single file, I don't have to uncompress large, unrelated portions of the archive. (So compressed .tar files are out, and solid .rar files are out.)
Since all .py files are small (only a few kilobytes in size), I don't need a streaming decompressor or seeking support within a file.
If possible, decompression should be initialized from a context dictionary generated from the union of the .py files, to save more space.

Which compression algorithm and C decompression library do you recommend?

I know about the concept of code minification (e.g. removing comments and extra whitespace, renaming local variables to single letter), and I'll consider using this technique for some of my .py files, but in this question I'm not interested in it. (See a Python minifier here.)

I know about the concept of bytecode compilation (.pyc files), but in this question I'm not interested in it. (The reason I don't want to have bytecode in the archive is that bytecode is architecture- and version-dependent, so it's less portable. Also .pyc files tend to be a bit larger than minified .py files.)

See my answers containing plan B and plan C. I'm still looking for plan A, which is smaller than ZIP (but it will be most probably larger than .tar.xz), and it has smaller overhead than .tar.xz.

Upvotes: 1

Answers (3)

pts

Reputation: 87401

FYI Plan B is just to use ZIP files. That's what I'm doing currently. Storing .py files in ZIP archives is very convenient for Python, because Python can load .py files from ZIP archives directly. But I need something smaller than a ZIP file, that's why I asked the question.

FYI Plan C is to use .tar.xz. Here is the analysis. The Linux kernel and Busybox 1.8.5 contain an .xz decompressor, which compiles to 18 kB of x86 code, which fulfills my requirement of a small decompression library. The .tar.xz with xz -6 -C crc32 gives a compression ratio of 6.648 over the .tar file. The overhead of the .xz decompressor of Busybox 1.8.5, compiled for x86 is 17840 bytes in code size (comparing the executable to the .tar.xz file). So this is plan C: when the executable starts, extract the whole archive into memory. (This takes about 0.35 second on my machine, the output is a 9MB memory block.) To read a file from the archive, use its in-memory uncompressed representation. This will be very fast. This backup plan is not a solution to my problem, because it involves a 0.35 second overhead at the beginning of execution, and it needs 9MB of extra memory.

Upvotes: 0

Daniel Roseman

Reputation: 600041

I know you've rejected .zip, but it might change your decision if you realise that Python is already capable of importing packages straight from zips, in the form of egg files. No extra code required, except for the setuptools configuration file.

Upvotes: 4

rid

Reputation: 63590

You should consider LZMA (also see the C SDK).

Upvotes: 2

Compression for Python source files

Answers (3)

Related Questions