Bounds on practical compression and Kolmogorov complexity

Question

I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this theoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)

Nick Fortescue · Accepted Answer

Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.

Suppose the following:

your document separator character is ,
corpus is documents abc,def,abc,def,abc,def,abc,
new document is def
your compression algorithm (or kolmogorov description language) just allows repetition by prefixing with a repeat count followed by | (this is a variant of run-length encoding)

Then:

compress(corpus) = "3|abc,def,1|abc"
compress(corpus+new_document) = "4|abc,def,"

So compress(corpus) is longer than compress(corpus+new_document). It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.

Edit It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.

Bounds on practical compression and Kolmogorov complexity

Answers (2)

Related Questions