Bulwersator
Bulwersator

Reputation: 1132

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding

results in

ASCII-8BIT

but I prefer this text in UTF-8. But

"C3A9".lines.to_a.pack('H*').encode("UTF-8")

results in

`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)

why? How can I convert it to UTF-8?

Upvotes: 1

Views: 7538

Answers (2)

mu is too short
mu is too short

Reputation: 434665

You're going about this the wrong way. If you have URI encoded data like this:

%C5%BBaba

Then you should use URI.unescape to decode it:

1.9.2-head :004 > URI.unescape('%C5%BBaba')
 => "Żaba"

If that doesn't work then force the encoding to UTF-8:

1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
 => "Żaba"

Upvotes: 6

Linuxios
Linuxios

Reputation: 35803

ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.


If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:

utf_char=[code.to_i(16)].pack("U")

Combine these with the rest of the string, you can make your string.

Upvotes: 4

Related Questions