Reputation: 157
I am trying to compress a 64 MByte text file which includes only 7 different letters, randomly distributed and which appear at approximately similar frequencies.
I noticed an interesting behavior of various compression software programs such as 7Zip and WinRar. Both applications achieve a compression ratio of about 35%.
When the file includes 8 different letters, also randomly distributed and appearing at approximately similar frequencies, the compression ratio is less than 0.3%!
Can anyone explain this?
Thanks.
Upvotes: 0
Views: 47
Reputation: 112642
It can only be explained by a mistake in how you constructed the file with "8 different letters". Your construction of the file with seven different letters appears correct, as the compression ratio should be log2(7)/8, which is 0.351. For the same thing with eight letters, the compression ratio is log2(8)/8, which is 0.375.
Perhaps your file has a repeating pattern in the eight letters.
Update:
You are using rand()
to generate your "random" distribution. Unfortunately, the classic implementation of rand()
has very poor randomness in the low few bits, with repeating patterns. Your % 7
uses all of the bits from rand()
, but % 8
uses only the low three bits. % 8
is equivalent to & 7
.
Use random()
instead, which generates random numbers for which the low bits, as well as any of the bits, have good random behavior.
Upvotes: 2