How do pack and unpack guesses the character encoding when converting to and from utf8?

Question

Suppose I want to convert "\xBD" to UTF-8.

If I use pack & unpack, I'll get ½:

puts "\xBD".unpack('C*').pack('U*')    #=> ½

as "\xBD" is ½ in ISO-8859-1.

BUT "\xBD" is œ in ISO-8859-9.

My question is: why pack used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?

I know I can use Iconv in Ruby 1.8.7, and String#encode in 1.9.2, but I'm curious about pack because I use it in some code.

Kelvin · Accepted Answer

This actually has nothing to do with how \xBD is represented in ISO-8859-x. The critical part is the pack into UTF-8.

The pack receives [189]. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½ and they just chose 189.

Since you're trying to learn more about pack/unpack, let me explain more:

When you unpack with the C directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD translates to 0xBD a.k.a. 189. This is a really basic conversion.

When you pack with the U directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.

pack/unpack have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.

How do pack and unpack guesses the character encoding when converting to and from utf8?

Answers (1)

Related Questions