Reputation: 133
According to RFC1951:
3.2.7. Compression with dynamic Huffman codes (BTYPE=10)
The Huffman codes for the two alphabets appear in the block
immediately after the header bits and before the actual
compressed data, first the literal/length code and then the
distance code. Each code is defined by a sequence of code
lengths, as discussed in Paragraph 3.2.2, above. For even
greater compactness, the code length sequences themselves are
compressed using a Huffman code.
I am really sorry but mostly what i understood is not from RFC1951 but from answers here. The statement above does not give a clear picture to a beginner or any visual context of what the data looks like.
I read that they are two trees from zlib's doc algorithm.txt:
Literals or match lengths are compressed with one Huffman tree, and
match distances are compressed with another tree. The trees are stored
in a compact form at the start of each block.
As we all can see that the later points out clearly than the RFC1951, Which brings the question.
What comes after (HCLEN+4)*3 bits?
I know that you add 4 to the decimal value of the 4 bits HCLEN then extract the exact total of that number from bit stream where each extract is 3 bit number (0 to 7) and Thanks M. Adler for clearing out how bits are read from right to left.
=====================================
bitstream data
_____________________________________
|17|57|????????
Where does the first tree end?
My goal is to identify why 2 inputs sharing a prefix aren't sharing a few bytes somewhere in the compressed stream. I know its because of the Block header but since dynamic blocks construct trees based on input data then something has to be identical in theory because it's encoding not hashing, so the headers themselves should have something similar like this:
raw block data at (19, 3)
Dynamic Huffman tree: length codes: 13, distances codes: 29, code_lengths_length: 16
lengths: [4, 0, 6, 4, 5, 4, 3, 3, 5, 6, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (19, 3)
Dynamic Huffman tree: length codes: 9, distances codes: 29, code_lengths_length: 16
lengths: [4, 0, 6, 5, 3, 3, 3, 3, 5, 6, 5, 3, 0, 0, 0, 0, 3, 5, 5]
raw block data at (20, 3)
Dynamic Huffman tree: length codes: 9, distances codes: 30, code_lengths_length: 16
lengths: [3, 0, 5, 0, 6, 3, 4, 5, 6, 6, 6, 2, 0, 0, 0, 0, 2, 5, 5]
deflate-blocktype 2 dyna huff beginning at (16, 0)
raw block data at (16, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 6, 6, 4, 4, 3, 3, 5, 5, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (17, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 6, 5, 4, 4, 3, 3, 5, 6, 5, 2, 0, 0, 0, 0, 3, 5, 5]
raw block data at (17, 3)
Dynamic Huffman tree: length codes: 18, distances codes: 30, code_lengths_length: 16
lengths: [4, 0, 5, 5, 4, 3, 3, 3, 4, 5, 5, 3, 0, 0, 0, 0, 3, 5, 5]
4C 9D D5 CE 35 3D 19 40 CF 49
6C 9C D7 B2 F3 4A 15 84 EF A9
4C 9C D7 B2 F3 BA 0D 85 EF 33
94 9D D7 B2 EB 4A 11 86 EF A9
94 9D D7 92 EB BA 11 45 DF 5D
Upvotes: 0
Views: 71
Reputation: 112502
What comes after (HCLEN+4)*3 bits?
What RFC 1951 says comes after it:
HLIT + 257 code lengths for the literal/length alphabet,
encoded using the code length Huffman code
Where does the first tree end?
It's not a tree, and I think you mean the second Huffman code description. The first Huffman code description was the (HCLEN+4)*3
bits.
The second code description ends when HLIT + 257
lengths from the Huffman codes using the first code description have been read.
RFC 1951 goes on to say:
HDIST + 1 code lengths for the distance alphabet,
encoded using the code length Huffman code
So the third code description ends when HDIST + 1
lengths from the Huffman codes using the first code description have been read.
There may be fewer Huffman codes than lengths, since some of the codes represent repeated zero lengths or repeats of the last length.
Also, as noted in RFC 1951:
The code length repeat codes can cross from HLIT + 257 to the
HDIST + 1 code lengths. In other words, all code lengths form
a single sequence of HLIT + HDIST + 258 values.
So it is really a sequence of Huffman codes from the first code description, plus associated extra bits as defined for repeat codes, that describe HLIT + HDIST + 258
lengths. Once you have that many lengths, the header has ended and the first length/literal code of the data follows.
Upvotes: 0