Reputation: 1463

How can I get the patterns spotted by bzip2? (or any other compression algorithm)

I have a huge file made up of the characters "0", "1", "2", "3". No spaces, not anything else. Just these 4 characters. I have used bzip2 to compress it and the file decreased size from X to 0.05*X. I'd like to know what were the strings/patterns that bzip2 found to achieve that compressed version of the file (e.g. 0123213232, 0123121212222112, etc.). Is there a simple way to extract that information either from the actual bz2 file or by running bzip2 with some special command-line option?

If you know the answer for some other existing compression program, I'll also be interested in hearing about that.

Thanks for any help.

Best, Surikator.

Upvotes: 2

Answers (3)

stribika

Reputation: 3176

Bzip2 uses Burrows-Wheeler transform to turn repeated byte sequences into sequences of the same byte in a reversible way. Then it uses the move-to-front algorithm to transform repeated bytes to zero sequences. After that it uses huffmann coding to assign shorter symbols to more frequent bytes (probably the zeros). You can find more details on the wikipedia page.

Upvotes: 3

Margus

Reputation: 20038

Burrows-Wheeler transform

It is also called block-sorting. If you do not like reading Wikipedia, then read Mathematical foundations of computer science 1999: http://books.google.ee/books?id=OcJjpqAi15EC&pg=PA34&lpg=PA34&dq=mathematica+Burrows%E2%80%93Wheeler+transform&source=bl&ots=KaOOIPJcKC&sig=5PzHG9UQeg3opr1FUMq8mPAxfn4&hl=et&ei=Y6vPTLfVFsqCOozvvPcE&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBMQ6AEwAA#v=onepage&q&f=false

Huffman coding

For a input of: "this is an example of a huffman tree". Binary tree like this is built:

alt text

It is then used to build coding table:

 Char ' ' nr(32)    | binary:00100000 | new binary:111
 Char 'a' nr(97)    | binary:01100001 | new binary:001
 Char 'e' nr(101)   | binary:01100101 | new binary:000
 Char 'f' nr(102)   | binary:01100110 | new binary:1101
 Char 'h' nr(104)   | binary:01101000 | new binary:1100
 Char 'i' nr(105)   | binary:01101001 | new binary:1001
 Char 'l' nr(108)   | binary:01101100 | new binary:01101
 Char 'm' nr(109)   | binary:01101101 | new binary:1000
 Char 'n' nr(110)   | binary:01101110 | new binary:1011
 Char 'o' nr(111)   | binary:01101111 | new binary:01100
 Char 'p' nr(112)   | binary:01110000 | new binary:01111
 Char 'r' nr(114)   | binary:01110010 | new binary:01110
 Char 's' nr(115)   | binary:01110011 | new binary:1010
 Char 't' nr(116)   | binary:01110100 | new binary:0101
 Char 'u' nr(117)   | binary:01110101 | new binary:01001
 Char 'x' nr(120)   | binary:01111000 | new binary:01000

New binary can only be read, if you have the same tree, so that is also backed in output. Also length of the data is store, because sum of new binary's is not full byte number.

Open software

You can just read

manual pages: http://www.bzip.org/1.0.5/bzip2-manual-1.0.5.html
or source : http://www.bzip.org/1.0.6/bzip2-1.0.6.tar.gz

Upvotes: 1

xscott

Reputation: 2420

bzip2 doesn't have an option for this, and it doesn't work exactly like I think you think it works. Regardless, you can find code for the various pieces in the algorithm. As @stribika mentioned, it uses the Burrows-Wheeler and move to front algorithms before pumping it through a Huffman encoder. Google should get you some results for a Burrow's Wheeler transform in the language of your choice.

However, based on what you're looking for, I think you want more of a dictionary style encoder. You might be interested in an LZW algorithm:

http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

It will build up a dictionary of strings like you showed.