Reputation: 1642
I am just wondering if someone could introduce me any algorithm that compresses Unicode text to 10-20 percent of its original size ? actually I've read Lempel-Ziv compression algorithm which reduces size of text to 60% of original size, but I've heard that there are some algorithms with this performance
Upvotes: 3
Views: 6158
Reputation: 1147
PAQ is the new reigning champion of text compression...There are a few different flavors and information about them can be found here.
There are three flavors that I recommend:
You have to build them yourself from source, fortunately someone made a GUI, FrontPAQ, that packages the two best binary into one.
Once you have a functional binary its simple to use, the documentation can be found here.
Note: I am aware this is a very old question, but I wish to include relevant modern data. I came looking for the same question, yet have found a newer more powerful answer.
Upvotes: 2
Reputation: 2200
LZ-like coders are not any good for text compression. The best one for direct use with unicode would be lzma though, as it has position alignment options. (http://www.7-zip.org/sdk.html)
But for best compression, I'd suggest to convert unicode texts to a bytewise format, eg. utf8, and then use an algorithm with known good results on texts, eg. BWT (http://libbsc.com) or PPMd (http://compression.ru/ds/ppmdj1.rar).
Also some preprocessing can be applied to improve results of text compression (see http://xwrt.sourceforge.net/) And there're some compressors with even better ratio than suggested ones (mostly paq derivatives), but they're also much slower.
Here I tested various representations of russian translation of Witten's "Modeling for text compression":
7z rar4 paq8px69 modeling_win1251.txt 156091 50227 42906 36254 modeling_utf16.txt 312184 52523 50311 38497 modeling_utf8.txt 238883 53793 44231 37681 modeling_bocu.txt 165313 53073 44624 38768 modeling_scsu.txt 156261 50499 42984 36485
It shows that longer input doesn't necessarily mean better overall compression, and that SCSU, although useful, isn't really the best representation of unicode text (win1251 codepage is one, too).
Upvotes: 3
Reputation: 6246
If You are considering only text compression than the very first algorithm that uses entropy based encryption called Huffman Encoding
Then there is LZW compression which uses a dictionary encoding to use previously used sequence of letter to assign codes to reduce size of file.
I think above two are sufficient for encoding text data efficiently and are easy to implement.
Note: Donot expect good compression on all files, If data is random with no pattern than no compression algorithm can give you any compression at all. Percentage of compression depends on symbols appearing in the file not only on the algorithm used.
Upvotes: 6