James
James

Reputation: 2566

Possible to create a compression algorithm that uses an enormous (100GB?) pseudo-random look-up file?

Would it be possible/practical to create a compression algorithm that splits a file into chunks and then compares those chunks against an enormous (100GB?, 200GB?) psuedo-random file?

The resulting "compressed" file would contain an ordered list of offsets and lengths. Everyone using the algorithm would need the same enormous file in order to compress/decompress files.

Would this work? I assume someone else has thought of this before and tried it but it's a tough one to Google.

Upvotes: 3

Views: 564

Answers (2)

usr
usr

Reputation: 171178

Cyan is correct. Even more: You wouldn't need to have such a file. You can deterministically produce the same pseudo-random sequence without ever storing it. By looking at it that way you see that your random lookup file has no value.

Upvotes: 3

Cyan
Cyan

Reputation: 13948

It's a common trick, used by many compression "claimers", which regularly announce "revolutionary" compression ratio, up to ridiculous levels.

The trick depends, obviously, on what's in the reference dictionary.

If such a dictionary is just "random", as suggested, then it is useless. Simple math will show that the offset will cost, on average, as much as the data it references.

But if the dictionary happens to contain large parts or the entire input file, then it will be "magically" compressed to a reference, or series of references.

Such tricks are called "hiding the entropy". Matt Mahoney wrote a simple program (barf) to demonstrate this technique, up to the point of reducing anything to 1 byte.

The solution to this trickery is that a comparison exercise should always include the compressed data, the decompression program, and any external dictionary it uses. When all these elements are counted in the equation, then it's no longer possible to "hide" entropy anywhere. And the cheat get revealed....

Upvotes: 5

Related Questions