hipstersmoothie
hipstersmoothie

Reputation: 57

LZW Compression with Entire unicode library

I am trying to do this problem:

Assume we have an initial alphabet of the entire Unicode character set, instead of just all the possible byte values. Recall that unicode characters are unsigned 2-byte values, so this means that each 2 bytes of uncompressed data will be treated as one symbol, and we'll have an alphabet with over 60,000 symbols. (Treating symbols as 2-byte Unicodes, rather than a byte at a time, makes for better compression in the case of internationalized text.) And, note, there's nothing that limits the number of bits per code to at most 16. As you generalize the LZW algorithm for this very large alphabet, don't worry if you have some pretty long codes.

With this, give the compressed version of this four-symbol sequence, using our project assumptions, including an EOD code, and grouping into 4-byte ints. (These three symbols are Unicode values, represented numerically.) Write your answer as 3 8-digit hex values, space separated, using capital hex digits, not lowercase.

32767 32768 32767 32768

The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.

Also, it seems to me that it will only take two integers the compressed data.

Upvotes: 1

Views: 1322

Answers (1)

Alexey Frunze
Alexey Frunze

Reputation: 62058

The problem statement is ill-formed.

In Unicode, as we know it today, code points (those numbers that represent characters, composable parts of characters and other useful but more sneaky things) cannot be all numbered from 0 to 65535 to fit into 16 bits. There are more than 100 thousand of Chinese, Japanese and Korean characters in Unicode. Clearly, you'd need 17+ bits just for those. So, Unicode clearly cannot be the correct option here.

OTOH, there exist a sort of "abridged" version of Unicode, Universal Character Set, whose UCS-2 encoding uses 16-bit code points and can technically be used for at most 65536 characters and the like. Those characters with codes greater than 65535 are, well, unlucky, you can't have them with UCS-2.

So, if it's really UCS-2, you can download its specification (ISO/IEC 10646, I believe) and figure out exactly which codes out of those 64K are used and thus should form your initial LZW alphabet.

Upvotes: 2

Related Questions