Reputation: 734
Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.
I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6
Basically, I want to convert the string '\u03a3' -> "Σ".
What I had is:
data.gsub /\\u([a-zA-Z0-9]{4})/, $1.hex.to_i.chr
Which of course gives "931 out of char range" error.
Thank you Tim
Upvotes: 4
Views: 8937
Reputation: 21110
You can pass an encoding to the Integer#chr
:
chr([encoding]) → string
Returns a string containing the character represented by the
int
's value according toencoding
.65.chr #=> "A" 230.chr #=> "\xE6" 255.chr(Encoding::UTF_8) #=> "\u00FF"
So instead of using .chr
, use .chr(Encoding::UTF_8)
.
Upvotes: 1
Reputation: 71
Try this :
[0x50].pack("U")
where 0x50
is the hex code of the utf8 char.
Upvotes: 7
Reputation: 123453
Ruby (at least, 1.8.6) doesn't have full Unicode support. Integer#chr
only supports ASCII characters and otherwise only up to 255
in octal notation ('\377'
).
To demonstrate:
irb(main):001:0> 255.chr
=> "\377"
irb(main):002:0> 256.chr
RangeError: 256 out of char range
from (irb):2:in `chr'
from (irb):2
You might try upgrading to Ruby 1.9. The chr
docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.
Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.
Otherwise, I don't think you can do quite what you want in Ruby, currently.
Upvotes: 1
Reputation: 10142
Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.
If you need to do conversions, look at the Iconv library.
In any case, Σ
could be better alternative to \u03a3
. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.
Upvotes: 3