Reputation: 26622
I want to estimate Kolmogorov complexity of short strings (about one word long) by compressing them with LZMA and taking the compression ratio.
What's the most efficient way of doing it in Python3?
Upvotes: 3
Views: 3240
Reputation: 1207
Edit:
I am not sure if this is a good way to estimate the complexity of short strings, because to correctly compute the Kolmogorov (K-) complexity of a string we have to factor in the length of the program used to decompress the string. The length of the program (67k for xz 5.1.0 on my Debian laptop) will overwhelm short strings. Thus, the following program is closer to computing a K-complexity upper bound:
import lzma #For python 2.7 use backports.lzma
program_length = 67000
def lzma_compression_ratio(test_string):
bytes_in = bytes(test_string,'utf-8')
bytes_out = lzma.compress(bytes_in)
lbi = len(bytes_in)
lbo = len(bytes_out)+program_length
ratio = lbo/lbi
message = '%d bytes compressed to %d bytes, ratio %0.3f'%(lbi,lbo,ratio)
print(message)
return ratio
test_string = 'a man, a plan, a canal: panama'
lzma_compression_ratio(test_string)
for n in range(22,25):
test_string = 'a'*(2**n)
lzma_compression_ratio(test_string)
The output below shows that the compression ratio is over 2000 for a string of 30 a's, and falls below 0.01 for repeating strings of length 2^23. These are technically correct upper bounds on K-complexity, but clearly not useful for the short string. The program "print('a'*30)" has length 13, which gives a K-complexity upper bound of 0.43 (13/30) for the string 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.
30 bytes compressed to 67024 bytes, ratio 2234.133
4194304 bytes compressed to 67395 bytes, ratio 0.016
8388608 bytes compressed to 68005 bytes, ratio 0.008
16777216 bytes compressed to 69225 bytes, ratio 0.004
Original answer:
@Superbest, this seems to work, but I don't know if it's the most efficient:
import lzma
def lzma_compression_ratio(test_string):
c = lzma.LZMACompressor()
bytes_in = bytes(test_string,'utf-8')
bytes_out = c.compress(bytes_in)
return len(bytes_out)/len(bytes_in)
test_string = 'a man, a plan, a canal: panama'
compression_ratio = lzma_compression_ratio(test_string)
print(compression_ratio)
Upvotes: 3