Possible to create a compression algorithm that uses an enormous (100GB?) pseudo-random look-up file?

Question

Would it be possible/practical to create a compression algorithm that splits a file into chunks and then compares those chunks against an enormous (100GB?, 200GB?) psuedo-random file?

The resulting "compressed" file would contain an ordered list of offsets and lengths. Everyone using the algorithm would need the same enormous file in order to compress/decompress files.

Would this work? I assume someone else has thought of this before and tried it but it's a tough one to Google.

Cyan · Accepted Answer

It's a common trick, used by many compression "claimers", which regularly announce "revolutionary" compression ratio, up to ridiculous levels.

The trick depends, obviously, on what's in the reference dictionary.

If such a dictionary is just "random", as suggested, then it is useless. Simple math will show that the offset will cost, on average, as much as the data it references.

But if the dictionary happens to contain large parts or the entire input file, then it will be "magically" compressed to a reference, or series of references.

Such tricks are called "hiding the entropy". Matt Mahoney wrote a simple program (barf) to demonstrate this technique, up to the point of reducing anything to 1 byte.

The solution to this trickery is that a comparison exercise should always include the compressed data, the decompression program, and any external dictionary it uses. When all these elements are counted in the equation, then it's no longer possible to "hide" entropy anywhere. And the cheat get revealed....

Possible to create a compression algorithm that uses an enormous (100GB?) pseudo-random look-up file?

Answers (2)

Related Questions