Reputation: 6698
I have a bunch of long strings (16200 characters) that I want to compress. The entire string only uses 12 different characters (currently _oOwWgGmdDsS and, but those can change if needed).
I'm looking to compress this string. I currently made a compression scheme myself, where each time I first put the character, and then how many times it appears before another one is in the string. So if the uncompressed text looks like this:
ooooooWW_
Then the compressed becomes
o6W2_1
For the strings I currently have this reduced the size from about 128MB to 4MB. However, as you can see, for the W's there is no saving, and for the _ there's even a loss.
So I was wondering, are there more sophisticated compression schemes I can use? The end result has to be plain text however, not binary data.
Note: It would also be awesome if there exists a library for both Python and Lua for them.
Upvotes: 2
Views: 5837
Reputation: 3352
This question seems like one asking implicitly for some pointers into what compression is and how it works. Mark's answer works, for longer strings, but I'd also suggest you read this guide on what zlib actually does.
Running Mark's code (edited to decompress the compressed text) in iPython3:
In [1]: import zlib
...: import base64
...: text = input('Text to compress > ')
...: compressed = base64.b64encode(zlib.compress(text.encode())).decode()
...: print('Compressed Text:', compressed)
...: decompressed = zlib.decompress(base64.b64decode(compressed)).decode()
...: print('Decompressed Text:', decompressed)
Text to compress > some text I wrote
Compressed Text: eJwrzs9NVShJrShR8FQoL8ovSQUAOSwGVA==
Decompressed Text: some text I wrote
You can see that the "compressed" text is actually roughly twice as many characters as the input for this tiny example. A longer example input (of say 200 chars) begins to show benefit.
This is because base64 encoding:
[...] causes an overhead of 33–36% (33% by the encoding itself; up to 3% more by the inserted line breaks).
So you need to compress your data beforehand more efficiently than that overhead to see any benefit. Meanwhile, as Mark points out in this answer it's hard to predict independent of the data what compression rate zlib will give you.
Upvotes: 3
Reputation: 112284
Use zlib to compress to binary, and then base64 to expand the binary to plain text. Python has both built in. A little googling will turn up Lua bindings for zlib and base64 code.
Example:
import zlib
import base64
text = input('Text to compress > ')
compressed = base64.b64encode(zlib.compress(text.encode())).decode()
print('Compressed Text:', compressed)
text = input('Text do decompress > ')
decompressed = zlib.decompress(base64.b64decode(text.encode())).decode()
print('Decompressed Text:', decompressed)
Upvotes: 2