Reputation: 5545
Suppose I want to convert "\xBD"
to UTF-8.
If I use pack
& unpack
, I'll get ½
:
puts "\xBD".unpack('C*').pack('U*') #=> ½
as "\xBD"
is ½
in ISO-8859-1.
BUT "\xBD"
is œ
in ISO-8859-9.
My question is: why pack
used ISO-8859-1 instead of ISO-8859-9 to convert the char to UTF-8? Is there some way to configure that character encoding?
I know I can use Iconv
in Ruby 1.8.7, and String#encode
in 1.9.2, but I'm curious about pack
because I use it in some code.
Upvotes: 2
Views: 2302
Reputation: 20902
This actually has nothing to do with how \xBD
is represented in ISO-8859-x. The critical part is the pack
into UTF-8.
The pack
receives [189]
. The code point 189 is defined in UTF-8 (more precisely, Unicode) as being ½
. Don't think of this as the Unicode spec writers for "preferring" ISO-8859-1 over ISO-8859-9. They had to make a choice of what code point represented ½
and they just chose 189.
Since you're trying to learn more about pack
/unpack
, let me explain more:
When you unpack
with the C
directive, ruby interprets the string as ascii-8bit, and extracts the ascii codes. In this case \xBD
translates to 0xBD
a.k.a. 189
. This is a really basic conversion.
When you pack
with the U
directive, ruby will look up in its UTF-8 translation table to see what codepoints map to each of the integers in the array.
pack
/unpack
have very specific behavior depending on the directives you provide it. I suggest reading up on ruby-doc.org. Some of the directives still don't make sense to me, so don't be discouraged.
Upvotes: 4