Tim Reynolds
Tim Reynolds

Reputation: 734

Ruby: Convert encoded character to actual UTF-8 character

Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.

I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6

Basically, I want to convert the string '\u03a3' -> "Σ".

What I had is:

data.gsub /\\u([a-zA-Z0-9]{4})/,  $1.hex.to_i.chr

Which of course gives "931 out of char range" error.

Thank you Tim

Upvotes: 4

Views: 8937

Answers (4)

3limin4t0r
3limin4t0r

Reputation: 21110

You can pass an encoding to the Integer#chr:

chr([encoding]) → string

Returns a string containing the character represented by the int's value according to encoding.

65.chr    #=> "A"
230.chr   #=> "\xE6"
255.chr(Encoding::UTF_8)   #=> "\u00FF"

So instead of using .chr, use .chr(Encoding::UTF_8).

Upvotes: 1

webtu
webtu

Reputation: 71

Try this :

[0x50].pack("U")

where 0x50 is the hex code of the utf8 char.

Upvotes: 7

Jonathan Lonowski
Jonathan Lonowski

Reputation: 123453

Ruby (at least, 1.8.6) doesn't have full Unicode support. Integer#chr only supports ASCII characters and otherwise only up to 255 in octal notation ('\377').

To demonstrate:

irb(main):001:0> 255.chr
=> "\377"
irb(main):002:0> 256.chr
RangeError: 256 out of char range
        from (irb):2:in `chr'
        from (irb):2

You might try upgrading to Ruby 1.9. The chr docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.

Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.

Otherwise, I don't think you can do quite what you want in Ruby, currently.

Upvotes: 1

hrnt
hrnt

Reputation: 10142

Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.

If you need to do conversions, look at the Iconv library.

In any case, Σ could be better alternative to \u03a3. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.

Upvotes: 3

Related Questions