Reputation: 137
I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this theoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)
Upvotes: 3
Views: 595
Reputation: 44183
Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.
Suppose the following:
,
|
(this is a variant of run-length encoding)Then:
So compress(corpus)
is longer than compress(corpus+new_document)
. It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.
Edit It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.
Upvotes: 4
Reputation: 49794
Real-life compression algorithms have quirks like this but they only provide for a very crude approximation anyway.
And as for whether it can happen in theory, probably, but the difference isn't significant.
Let's assume you've got two strings, x and y, where x is a prefix of y. Let's say for example that
x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"
y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"
Let's assume furthermore that D is the shortest description of y. (I.e. K(y) = |D|)
In this case, x can be described as |"the number described by D minus 46 characters"|, which is longer than D but only by a small constant and a logarithmic factor (the number of characters in the rest of the instructions basically).
There might even be a shorter description of x but we know that at worst, K(x) <= K(y)+log(|y|-|x|)
However, you have to bear in mind that theoretical Kolmogorov complexity is incomputable and constant differences mean nothing in this field.
(N.b.: The RLE example above isn't valid as RLE isn't a Turing complete language, therefore it can't be used as a description language for Kolmogorov complexity.)
Upvotes: 1