Reputation: 11855

How to manually construct a gzip so that compressed file is larger than original?

Suppose a 1KB file called data.bin, If it's possible to construct a gzip of it data.bin.gz, but much larger, how to do it?

How much larger could we theoretically get in GZIP format?

Upvotes: 1

Answers (2)

ivan_pozdeev

Reputation: 35998

Summary:

With header fields/general structure: effect is unlimited unless it runs into software limitations
Empty blocks: unlimited effect by format specification
Uncompressed blocks: effect is limited to 6x
Compressed blocks: with apparent means, the maximum effect is estimated at 1.125x and is very hard to achieve

Take the gzip format (RFC1952 (metadata), RFC1951 (deflate format), additional notes for GNU gzip) and play with it as much as you like.

Header

There are a whole bunch of places to exploit:

use optional fields (original file name, file comment, extra fields)
bluntly append garbage (GNU gzip will issue a warning when decompressing)
concatenate multiple gzip archives (the format allows that, the resulting uncompressed data is, likewise, the concatenation or all chunks).
- An interesting side effect (a bug in GNU gzip, apparently): gzip -l takes the reported uncompressed size from the last chunk only (even if it's garbage) rather than adding up values from all. So you can make it look like the archive is (absurdly) larger/smaller than raw data.

These are the ones that are immediately apparent, you may be able to find yet other ways.

Data

The general layout of "deflate" format is (RFC1951):

A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.
<...>
Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.

Full blocks

The 00 00 00 ff ff that Mark Adler suggests is essentially an empty, non-final block (RFC1951 section 3.2.3. for the 1st byte, 3.2.4. for the uncompressed block itself).
_{Btw, according to gzip overview at the official site and the source code, Mark is the author of the decompression part...}

Uncompressed blocks

Using non-empty uncompressed blocks (see prev. section for references), you can at most create one for each symbol. The effect is thus limited to 6x.

Compressed blocks

In a nutshell: some inflation is achievable but it's very hard and the achievable effect is limited. Don't waste your time on them unless you have a very good reason.

Inside compressed blocks (section 3.2.5.), each chunk is [<encoded character(8-9 bits>|<encoded chunk length (7-11 bits)><distance back to data(5-18 bits)>], with lengths starting at 3. A 7-9-bit code unambiguously resolves to a literal character or a specific range of lengths. Longer codes correspond to larger lengths/distances. No space/meaningless stuff is allowed between chunks.
So the maximum for raw byte chunks is 9/8 (1.125x) - if all the raw bytes are with codes 144 - 255.
Playing with reference chunks isn't going to do any good for you: even a reference to a 3-byte sequence gives 25/24 (1.04x) at most.

That's it for static Huffman tables. Looking through the docs on dynamic ones, it optimizes the aforementioned encoding for the specific data or something. So, it should allow to make the ratio for the given data closer to the achievable maximum, but that's it.

Upvotes: 2

Mark Adler

Reputation: 112284

You can make it arbitrarily large. Take any gzip file and insert as many repetitions as you like of the five bytes: 00 00 00 ff ff after the gzip header and before the deflate data.