Reputation: 1339

String compression using repeated chars count

I'm going through Cracking the Code and there is this question where they ask to write a method for string compression so:

aabbccccaa

Would become:

a2b1c4a2

I came up with this:

''.join(y+str.count(y) for y in set(str))

But my output was:

a5c4b1

Could someone point me in the clean direction?

Sorry for bad edits, I'm on a cellphone

Upvotes: 2

Answers (2)

John

Reputation: 641

I thought I'd give a modification based on niemmi's answer that is forward and reversible up to 0x200 occurences of a character and is testable for correctness:

output = []
for k,g in groupby(txt):
    k = bytes(k,'ascii')
    g = sum(1 for x in g)
    if g > 0xff:
        output.append(b''.join(k,b'0xff'))
        by = bytes([0xff-g])
        output.append(b''.join(k,by))
    else:
        by = bytes([g])
        output.append(b''.join((k,by)))

compressed = b''.join(output)

decompressed = []
for char, count in zip(compressed[::2], compressed[1::2]):
    decompressed.append(chr(char)*count)

decompressed = ''.join(decompressed)

The compression can easily be extended up to any number of occurences beyond 0x200 using the above logic, but the decompression always stays the same for use with any number of occurences. The reason for using bytes is so that you can actually get compression if you were to use a file instead of a python datatype. However this is limited to ascii so you will need to preprocess like so:

import string
txt = open('to_compress.txt').read()
txt = "".join(filter(lambda x: x in string.printable, txt))

Furthermore, you will want to test that it works.

print("Same?", txt == decompressed)
>>> Same? True

You may want to install the pip package burrowswheeler, then you can just:

to_compress = burrowswheeler.transform(txt)
... compress decompress...
burrowswheeler.inverse(decompressed)

Upvotes: 0

niemmi

Reputation: 17263

You could use groupby to do the work for you:

>>> from itertools import groupby
>>> s = 'aabbccccaa'
>>> ''.join(k + str(sum(1 for x in g)) for k, g in groupby(s))
'a2b2c4a2'

Upvotes: 9

String compression using repeated chars count

Answers (2)

Related Questions