Reputation: 1463
I have a huge file made up of the characters "0", "1", "2", "3". No spaces, not anything else. Just these 4 characters. I have used bzip2 to compress it and the file decreased size from X to 0.05*X. I'd like to know what were the strings/patterns that bzip2 found to achieve that compressed version of the file (e.g. 0123213232, 0123121212222112, etc.). Is there a simple way to extract that information either from the actual bz2 file or by running bzip2 with some special command-line option?
If you know the answer for some other existing compression program, I'll also be interested in hearing about that.
Thanks for any help.
Best, Surikator.
Upvotes: 2
Views: 294
Reputation: 3176
Bzip2 uses Burrows-Wheeler transform to turn repeated byte sequences into sequences of the same byte in a reversible way. Then it uses the move-to-front algorithm to transform repeated bytes to zero sequences. After that it uses huffmann coding to assign shorter symbols to more frequent bytes (probably the zeros). You can find more details on the wikipedia page.
Upvotes: 3
Reputation: 20038
It is also called block-sorting. If you do not like reading Wikipedia, then read Mathematical foundations of computer science 1999: http://books.google.ee/books?id=OcJjpqAi15EC&pg=PA34&lpg=PA34&dq=mathematica+Burrows%E2%80%93Wheeler+transform&source=bl&ots=KaOOIPJcKC&sig=5PzHG9UQeg3opr1FUMq8mPAxfn4&hl=et&ei=Y6vPTLfVFsqCOozvvPcE&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBMQ6AEwAA#v=onepage&q&f=false
For a input of: "this is an example of a huffman tree"
. Binary tree like this is built:
It is then used to build coding table:
Char ' ' nr(32) | binary:00100000 | new binary:111
Char 'a' nr(97) | binary:01100001 | new binary:001
Char 'e' nr(101) | binary:01100101 | new binary:000
Char 'f' nr(102) | binary:01100110 | new binary:1101
Char 'h' nr(104) | binary:01101000 | new binary:1100
Char 'i' nr(105) | binary:01101001 | new binary:1001
Char 'l' nr(108) | binary:01101100 | new binary:01101
Char 'm' nr(109) | binary:01101101 | new binary:1000
Char 'n' nr(110) | binary:01101110 | new binary:1011
Char 'o' nr(111) | binary:01101111 | new binary:01100
Char 'p' nr(112) | binary:01110000 | new binary:01111
Char 'r' nr(114) | binary:01110010 | new binary:01110
Char 's' nr(115) | binary:01110011 | new binary:1010
Char 't' nr(116) | binary:01110100 | new binary:0101
Char 'u' nr(117) | binary:01110101 | new binary:01001
Char 'x' nr(120) | binary:01111000 | new binary:01000
New binary can only be read, if you have the same tree, so that is also backed in output. Also length of the data is store, because sum of new binary's is not full byte number.
You can just read
Upvotes: 1
Reputation: 2420
bzip2 doesn't have an option for this, and it doesn't work exactly like I think you think it works. Regardless, you can find code for the various pieces in the algorithm. As @stribika mentioned, it uses the Burrows-Wheeler and move to front algorithms before pumping it through a Huffman encoder. Google should get you some results for a Burrow's Wheeler transform in the language of your choice.
However, based on what you're looking for, I think you want more of a dictionary style encoder. You might be interested in an LZW algorithm:
http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
It will build up a dictionary of strings like you showed.
Upvotes: 1