Reputation: 3240
I am just starting to learn Ruby (to eventually move to RoR), but I was just told that Ruby does not support unicode. Is it true? How do Ruby programmers go about supporting unicode?
Upvotes: 32
Views: 17944
Reputation: 74945
What you heard is outdated and applies (only partially) to Ruby 1.8 or before. The latest stable version of Ruby (1.9), supports no less than 95 different character encodings (counted on my system just now). This includes pretty much all known Unicode Transformation Formats, including UTF-8.
The previous stable version of Ruby (1.8) has partial support for UTF-8.
If you use Rails, it takes care of default UTF-8 encoding for you. If all you need is UTF-8 encoding awareness, Rails will work for you no matter if you run Ruby 1.9 or Ruby 1.8. If you have very specific character encoding requirements, you should aim for Ruby 1.9.
If you're really interested, here is a series of articles describing the encoding issues in Ruby 1.8 and how they were worked around, and eventually solved in Ruby 1.9. Rails still includes workarounds for many common flaws in Ruby 1.8.
Upvotes: 31
Reputation: 1242
This is quite an old question. The current stable version of Ruby is 2.0.1. Yes, it handles most of what you can throw in Unicode at it, but please be aware that it breaks fairly easily.
Take a look at this code sample and results (inspired by this):
["noël","😸😾","baffle"].each do |str|
puts "Result for '#{str}'"
puts " Size: #{str.size}"
puts " Reverse: [#{str.reverse}]"
puts " Uppercase: [#{str.upcase}]"
end
Result for 'noël'
Size: 5 << bad size
Reverse: [l̈eon] <= accent is shifted
Uppercase: [NOËL]
Result for '😸😾'
Size: 2
Reverse: [😾😸]
Uppercase: [😸😾]
Result for 'baffle'
Size: 4
Reverse: [efflab] <= doesn't really make sense
Uppercase: [BAfflE] <= should be "ELFFAB"
The point is: modern Ruby handles the basics - more advanced string features shouldn't be counted on.
Upvotes: 5
Reputation: 13035
Adding the following line on top my file solved it.
# encoding: utf-8
Upvotes: 15
Reputation: 81520
In this answer to a different question, one person said they had trouble with Iconv when handling unicode data in Ruby 1.9, but I can't vouch for its accuracy.
Upvotes: 0
Reputation: 369458
That's not true. What is true is that Ruby does not support only Unicode, it supports a whole slew of other encodings as well.
This is in contrast to systems such as Java, .NET or Python, which follow the "One Encoding To Rule Them All" model. Ruby has what one of the designers of Ruby's m17n system calls a "CSI" model (Code Set Indepedent), which means that instead of all strings just having one and the same encoding, every string is tagged with its own encoding.
This has some significant advantages both for ease of use and performance, because it means that if your input and output encodings are the same, you never need to transcode, whereas with the One True Encoding model, you need to transcode twice in the worst case (and that worst case unfortunately happens pretty often, because most of these environments chose an internal encoding that nobody actually uses), from the input encoding into the internal encoding and then to the output encoding. In Ruby, you need to transcode at most once.
The basic problem with the OTE model is that whatever encoding you choose as the One True Encoding, it will be a completely arbitrary choice, since there simply isn't a single encoding that everybody, or even a majority, uses.
In Java, for example, they chose UCS-2 as the One True Encoding. Then, a couple of years later, it turned out that UCS-2 was actually not enough to encode all characters, so they had to make a backwards-incompatible change to Java, to switch to UTF-16 as the One True Encoding. Except by that time, a significant portion of the world had moved on from UTF-16 to UTF-8. If Java had been invented a couple of years earlier, they would probably have chosen ASCII as the One True Encoding. If it had been invented in another country, it might be Shift-JIS. If it had been invented by another company, it might be EBCDIC. It's really completely arbitrary, and such an important choice shouldn't be.
Upvotes: 14