Reputation: 1339
I'm going through Cracking the Code and there is this question where they ask to write a method for string compression so:
aabbccccaa
Would become:
a2b1c4a2
I came up with this:
''.join(y+str.count(y) for y in set(str))
But my output was:
a5c4b1
Could someone point me in the clean direction?
Sorry for bad edits, I'm on a cellphone
Upvotes: 2
Views: 1506
Reputation: 641
I thought I'd give a modification based on niemmi's answer that is forward and reversible up to 0x200 occurences of a character and is testable for correctness:
output = []
for k,g in groupby(txt):
k = bytes(k,'ascii')
g = sum(1 for x in g)
if g > 0xff:
output.append(b''.join(k,b'0xff'))
by = bytes([0xff-g])
output.append(b''.join(k,by))
else:
by = bytes([g])
output.append(b''.join((k,by)))
compressed = b''.join(output)
decompressed = []
for char, count in zip(compressed[::2], compressed[1::2]):
decompressed.append(chr(char)*count)
decompressed = ''.join(decompressed)
The compression can easily be extended up to any number of occurences beyond 0x200 using the above logic, but the decompression always stays the same for use with any number of occurences. The reason for using bytes is so that you can actually get compression if you were to use a file instead of a python datatype. However this is limited to ascii so you will need to preprocess like so:
import string
txt = open('to_compress.txt').read()
txt = "".join(filter(lambda x: x in string.printable, txt))
Furthermore, you will want to test that it works.
print("Same?", txt == decompressed)
>>> Same? True
You may want to install the pip package burrowswheeler, then you can just:
to_compress = burrowswheeler.transform(txt)
... compress decompress...
burrowswheeler.inverse(decompressed)
Upvotes: 0