Superbest
Superbest

Reputation: 26622

Fastest way to measure compression ratio of a string in Python3

I want to estimate Kolmogorov complexity of short strings (about one word long) by compressing them with LZMA and taking the compression ratio.

What's the most efficient way of doing it in Python3?

Upvotes: 3

Views: 3240

Answers (1)

rjonnal
rjonnal

Reputation: 1207

Edit:

I am not sure if this is a good way to estimate the complexity of short strings, because to correctly compute the Kolmogorov (K-) complexity of a string we have to factor in the length of the program used to decompress the string. The length of the program (67k for xz 5.1.0 on my Debian laptop) will overwhelm short strings. Thus, the following program is closer to computing a K-complexity upper bound:

import lzma #For python 2.7 use backports.lzma

program_length = 67000

def lzma_compression_ratio(test_string):
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = lzma.compress(bytes_in)
    lbi = len(bytes_in)
    lbo = len(bytes_out)+program_length
    ratio = lbo/lbi
    message = '%d bytes compressed to %d bytes, ratio %0.3f'%(lbi,lbo,ratio)
    print(message)
    return ratio

test_string = 'a man, a plan, a canal: panama'
lzma_compression_ratio(test_string)

for n in range(22,25):
    test_string = 'a'*(2**n)
    lzma_compression_ratio(test_string)

The output below shows that the compression ratio is over 2000 for a string of 30 a's, and falls below 0.01 for repeating strings of length 2^23. These are technically correct upper bounds on K-complexity, but clearly not useful for the short string. The program "print('a'*30)" has length 13, which gives a K-complexity upper bound of 0.43 (13/30) for the string 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'.

30 bytes compressed to 67024 bytes, ratio 2234.133
4194304 bytes compressed to 67395 bytes, ratio 0.016
8388608 bytes compressed to 68005 bytes, ratio 0.008
16777216 bytes compressed to 69225 bytes, ratio 0.004

Original answer:

@Superbest, this seems to work, but I don't know if it's the most efficient:

import lzma

def lzma_compression_ratio(test_string):
    c = lzma.LZMACompressor()
    bytes_in = bytes(test_string,'utf-8')
    bytes_out = c.compress(bytes_in)
    return len(bytes_out)/len(bytes_in)

test_string = 'a man, a plan, a canal: panama'
compression_ratio = lzma_compression_ratio(test_string)
print(compression_ratio)

Upvotes: 3

Related Questions