Identify more Compressible Dataset by observing Input distribution

Question

This may be a repeat of the question here: Predict Huffman compression ratio without constructing the tree

So basically, I have the probabilistic distribution of two datasets with the same variables but different probabilities. Now, is there any way that by looking at the variable distribution, I can to some degree confidently say that the dataset, when passed through a Huffman Coding implementation would achieve a higher compression ratio than the other?

One of the solutions that I came across was to calculate the upper bound using conditional entropy and then compute the average code length. Is there any other approach that can I can probably explore before using the said method?

Thanks a lot.

Mark Adler · Accepted Answer

I don't know what "to some degree confidently" means, but you can get a lower bound on the compressed size of each set by computing the zero-order entropy as done in the linked question (the negative of the sum of the probabilities times the log of the probabilities). Then the lower entropy very likely produces a shorter Huffman coding than the higher entropy. It is not definite, as I am sure that one could come up with a counter-example.

You also need to send a description of the code itself if you want to decode it on the other end, which adds a wrinkle to the comparison. However if the data is much larger than the code description, then that will be lost in the noise.

Simply generating the code, the coded data, and the code description is very fast. The best solution is to do that, and compare the resulting number of bits directly.

Identify more Compressible Dataset by observing Input distribution

Answers (1)

Related Questions