Questions123
Questions123

Reputation: 49

Compression Gzip/7Zip

When I calculate the entropy values of files compressed with Gzip, PKZIP, 7ZIP and Winrar, I find that the compression rate of Gzip is higher than the others. The entropy value is higher (indicating less redundancy) and the file size is smaller. Even for small files, the overhead of Gzip is lower compared to the other algorithms. To be fair, this is not the case for all file formats, e.g. for xlsx, 7- ZIP and PKzip have better results than Gzip and Winrar. But still. I'm quite surprised because 7- ZIP is generally considered a better compression algorithm in terms of.... it reduces the file size more, but that does not really correspond with my results. Or I did something completely wrong... or...?

I did not base these results on a few files, I compressed a whole bunch of things from different file formats and calculated the delta of the file sizes with Python.

What I also find quite interesting. When I look at PDF files, I would expect that especially PDF 1.5 or higher can hardly be compressed by a lossless compression algorithm, as they are already heavily compressed by themselves. But I don't see much difference between PDF < 1.5 and 1.5 >, both are compressed quite heavily by these compression tools.

By the way, I used the default algorithms and settings of these archivers

Can someone explain how/why this is the case (maybe I'm doing something wrong) or maybe these results does make sense (but I can't find something on the internet that does support this)?

Upvotes: -1

Views: 1109

Answers (3)

greybeard
greybeard

Reputation: 2516

The tools you compare are file archivers - with the exception of gzip.

File archivers are used to handle more than one file and/or one or more hierarchies of files. Usually, the file formats keep metadata about the individual files, the tools allow operations with any given individual file with much less effort than handling the entire archive.

gzip has been used to un/compress archives handled by (then) non-compressing archivers such as tar(tape archiever, late 1970ies)/pax: solid compression. The metadata can be as low as 18 bytes - for the entire file.

Upvotes: 0

D.W.
D.W.

Reputation: 3615

We can't answer whether you've done anything wrong without seeing the specifics of your experimental methodology and your results, but here are some general remarks:

The compression rate is likely to depend on what types of files you are compressing. I think you might find it is common that one compression algorithm does better at some types of files than a second algorithm, and worse for other types of files.

Also, there is a tradeoff between compression rate and computation time. Some compressors "try harder" to compress, i.e., they are willing to spend more computation time in exchange for hopefully better compression.

Finally, some compression algorithms are "just better" in that they're likely to perform better across the board on many types of files.

There might be a misunderstanding. Compression rate is defined as the size of the original file divided by the size of the compressed file. The entropy of the compressed file does not affect the compression rate.

Upvotes: 1

Mark Adler
Mark Adler

Reputation: 112502

"The entropy value is higher (indicating less redundancy) ...". The entropy is relative to a model of the data. If you are using zeroth-order entropy, that can only provide an indication that the data has been compressed (or encrypted), and appears to be random. If the result is close to the number of bits you are measuring, which I'm sure it is in this case, then it can't be used to compare the effectiveness of compression.

"... and the file size is smaller." That's the only way to compare the effectiveness of compression.

The tools you mention all, except for gzip, have several different compression methods they can employ. For each (including gzip), there are levels of compression, i.e. how hard it works at it, that can be specified. If you're going to attempt to benchmark compression methods, you need to at least say what they were and what parameters were given to them.

Though you don't need to bother. There are many that have already been done for you. Google "compression benchmark".

Upvotes: 2

Related Questions