The Oddler
The Oddler

Reputation: 6698

Compressing simple text to text

I have a bunch of long strings (16200 characters) that I want to compress. The entire string only uses 12 different characters (currently _oOwWgGmdDsS and, but those can change if needed).

I'm looking to compress this string. I currently made a compression scheme myself, where each time I first put the character, and then how many times it appears before another one is in the string. So if the uncompressed text looks like this:

ooooooWW_

Then the compressed becomes

o6W2_1

For the strings I currently have this reduced the size from about 128MB to 4MB. However, as you can see, for the W's there is no saving, and for the _ there's even a loss.

So I was wondering, are there more sophisticated compression schemes I can use? The end result has to be plain text however, not binary data.

Note: It would also be awesome if there exists a library for both Python and Lua for them.

Upvotes: 2

Views: 5837

Answers (2)

Alex Moore-Niemi
Alex Moore-Niemi

Reputation: 3352

This question seems like one asking implicitly for some pointers into what compression is and how it works. Mark's answer works, for longer strings, but I'd also suggest you read this guide on what zlib actually does.

Running Mark's code (edited to decompress the compressed text) in iPython3:

In [1]: import zlib
   ...: import base64
   ...: text = input('Text to compress > ')
   ...: compressed = base64.b64encode(zlib.compress(text.encode())).decode()
   ...: print('Compressed Text:', compressed)
   ...: decompressed = zlib.decompress(base64.b64decode(compressed)).decode()
   ...: print('Decompressed Text:', decompressed)
Text to compress > some text I wrote
Compressed Text: eJwrzs9NVShJrShR8FQoL8ovSQUAOSwGVA==
Decompressed Text: some text I wrote

You can see that the "compressed" text is actually roughly twice as many characters as the input for this tiny example. A longer example input (of say 200 chars) begins to show benefit.

This is because base64 encoding:

[...] causes an overhead of 33–36% (33% by the encoding itself; up to 3% more by the inserted line breaks).

So you need to compress your data beforehand more efficiently than that overhead to see any benefit. Meanwhile, as Mark points out in this answer it's hard to predict independent of the data what compression rate zlib will give you.

Upvotes: 3

Mark Adler
Mark Adler

Reputation: 112284

Use zlib to compress to binary, and then base64 to expand the binary to plain text. Python has both built in. A little googling will turn up Lua bindings for zlib and base64 code.

Example:

import zlib
import base64
text = input('Text to compress > ')
compressed = base64.b64encode(zlib.compress(text.encode())).decode()
print('Compressed Text:', compressed)
text = input('Text do decompress > ')
decompressed = zlib.decompress(base64.b64decode(text.encode())).decode()
print('Decompressed Text:', decompressed)

Upvotes: 2

Related Questions