SatheeshJM
SatheeshJM

Reputation: 3633

Text Compression - What algorithm to use

I need to compress some text data of the form

[70,165,531,0|70,166,562|"hi",167,578|70,171,593|71,179,593|73,188,609|"a",1,3|

The data contains a few thousand characters(10000 - 50000 approx).

I read upon the various compression algorithms, but cannot decide which one to use here.

The important thing here is : The compressed string should contain only alphanumberic characters(or a few special characters like +-/&%@$..) I mean most algorithms provide gibberish ascii characters as compressed data right? That must be avoided.

Can someone guide me on how to proceed here?

P.S The text contains numbers , ' and the | character predominantly. Other characters occur very very rarely.

Upvotes: 1

Views: 3307

Answers (2)

fvu
fvu

Reputation: 32953

Actually your requirement to limit the output character set to printable characters automatically costs you 25% of your compression gain, as out of 8 bits per by you'll end up using roughly 6.

But if that's what you really want, you can always base64 or the more space efficient base85 the output to reconvert the raw bytestream to printable characters.

Regarding the compression algorithm itself, stick to one of the better known ones like gzip or bzip2, for both well tested open source code exists.

Selecting "the best" algorithm is actually not that easy, here's an excerpt of the list of questions you have to ask yourself:

  1. do i need best speed on the encoding or decoding side (eg bzip is quite asymmetric)
  2. how important is memory efficiency both for the encoder and the decoder? Could be important for embedded applications
  3. is the size of the code important, also for embedded
  4. do I want pre existing well tested code for encoder or decorder or both only in C or also in another language
  5. and so on

The bottom line here is probably, take a representative sample of your data and run some tests with a couple of existing algorithms, and benchmark them on the criteria that are important for your use case.

Upvotes: 7

kratenko
kratenko

Reputation: 7592

Just one thought: You can solve your two problems independently. Use whatever algorithm gives you the best compression (just try out a few on your kind of data. bz2, zip, rar -- whatever you like, and check the size), and then to get rid of the "gibberish ascii" (that's actually just bytes there...), you can encode your compressed data with Base64.

If you really put much thought into it, you might find a better algorithm for your specific problem, since you only use a few different chars, but if you stumble upon one, I think it's worth a try.

Upvotes: 4

Related Questions