Reputation: 161
By pigeonhole principle, every lossless compression algorithm can be "defeated", i.e. for some inputs it produces outputs which are longer than the input. Is it possible to explicitly construct a file which, when fed to e.g. gzip or other lossless compression program, will lead to (much) larger output? (or, betters still, a file which inflates ad infinitum upon subsequent compressions?)
Upvotes: 6
Views: 1534
Reputation: 2026
Well, I'd assume eventually it'll max out since the bit patterns will repeat, but I just did:
touch file
gzip file -c > file.1
...
gzip file.9 -c > file.10
And got:
0 bytes: file
25 bytes: file.1
45 bytes: file.2
73 bytes: file.3
103 bytes: file.4
122 bytes: file.5
152 bytes: file.6
175 bytes: file.7
205 bytes: file.8
232 bytes: file.9
262 bytes: file.10
Here are 24,380 files graphically (this is really surprising to me, actually):
alt text http://research.engineering.wustl.edu/~schultzm/images/filesize.png
I was not expecting that kind of growth, I would just expect linear growth since it should just be encapsulating the existing data in a header with a dictionary of patterns. I intended to run through 1,000,000 files, but my system ran out of disk space way before that.
If you want to reproduce, here is the bash script to generate the files:
#!/bin/bash
touch file.0
for ((i=0; i < 20000; i++)); do
gzip file.$i -c > file.$(($i+1))
done
wc -c file.* | awk '{print $2 "\t" $1}' | sed 's/file.//' | sort -n > filesizes.txt
The resulting filesizes.txt is a tab-delimited, sorted file for your favorite graphing utility. (You'll have to manually remove the "total" field, or script it away.)
Upvotes: 8
Reputation: 655209
All these compression algorithms are looking for redundant data. If you file has no or very less redundancy in it (like a sequence of abac…az
, bcbd…bz
, cdce…cz
, etc.) it is very likely that the “deflated” output is rather an inflation.
Upvotes: 0
Reputation: 53310
Random data, or data encrypted with a good cypher would probably be best.
But any good packer should only add constant overhead, once it decides that it can't compress the data. (@Frank). For a fixed overhead, an empty file, or a single character will give the greatest percentage overhead.
For packers that include the filename (e.g. rar, zip, tar), you could of course just make the filename really long :-)
Upvotes: 3
Reputation: 61737
A text file with 1 byte in it (for example one character like 'A') is stored in 1 byte on the disk but winrar rars it to 94 bytes and zips to 141 bytes.
I know it's a sort of cheat answer but it works. I think it's going to be the biggest % difference between original size and 'compressed' size you are going to see.
Take a look at the formula for zipping, they are reasonably simple, and to make 'compressed' file larger than the original, the most basic way is to avoid any repeating data.
Upvotes: 0
Reputation: 4416
Try to gzip the file that results from the following command:
echo a > file.txt
The compression of a 2 bytes file resulted of a 31 bytes gzipped file!
Upvotes: 0