Avi Tal
Avi Tal

Reputation: 157

Interesting behavior of various compression software programs

I am trying to compress a 64 MByte text file which includes only 7 different letters, randomly distributed and which appear at approximately similar frequencies.

I noticed an interesting behavior of various compression software programs such as 7Zip and WinRar. Both applications achieve a compression ratio of about 35%.

When the file includes 8 different letters, also randomly distributed and appearing at approximately similar frequencies, the compression ratio is less than 0.3%!

Can anyone explain this?

Thanks.

Upvotes: 0

Views: 47

Answers (1)

Mark Adler
Mark Adler

Reputation: 112642

It can only be explained by a mistake in how you constructed the file with "8 different letters". Your construction of the file with seven different letters appears correct, as the compression ratio should be log2(7)/8, which is 0.351. For the same thing with eight letters, the compression ratio is log2(8)/8, which is 0.375.

Perhaps your file has a repeating pattern in the eight letters.

Update:

You are using rand() to generate your "random" distribution. Unfortunately, the classic implementation of rand() has very poor randomness in the low few bits, with repeating patterns. Your % 7 uses all of the bits from rand(), but % 8 uses only the low three bits. % 8 is equivalent to & 7.

Use random() instead, which generates random numbers for which the low bits, as well as any of the bits, have good random behavior.

Upvotes: 2

Related Questions