Reputation: 94304
I am trying to figure out a 'proper' way of sorting UTF-8 strings in Ruby on Rails.
In my application, I have a select box that is populated with countries. As my application is localized, each existing locale has a countries.yml file that relates a country's id to the localized name for that country. I can't sort the strings manually in the yml file because I need the ID to be consistent across all locales.
What I have done is create a ascii_name
method which uses the unidecode
gem to convert accented and non-latin characters to their ascii equivalent (for instance, "Afeganistão" would become "Afeganistao"), and then sort on that:
require 'unidecode'
class Country
def ascii_name
Unidecoder.decode(name).gsub("[?]", "").gsub(/`/, "'").strip
end
end
Country.all.sort_by(:&ascii_name)
However, there are obvious issues with this:
Does anyone know of a better way that I could sort my strings?
Upvotes: 15
Views: 6871
Reputation: 138
http://github.com/grosser/sort_alphabetical
This gem should help. It adds sort_alphabetical
and sort_alphabetical_by
methods to Enumberable.
Upvotes: 9
Reputation: 19238
Ruby peforms string comparisons based on byte values of characters:
%w[à a e].sort
# => ["a", "e", "à"]
To properly collate strings according to locale, the ffi-icu gem could be used:
require "ffi-icu"
ICU::Collation.collate("it_IT", %w[à a e])
# => ["a", "à", "e"]
ICU::Collation.collate("de", %w[a s x ß])
# => ["a", "s", "ß", "x"]
As an alternative:
collator = ICU::Collation::Collator.new("it_IT")
%w[à a e].sort { |a, b| collator.compare(a, b) }
# => %w[a à e]
Update To test how strings should collate according to locale rules the ICU project provides this nice tool.
Upvotes: 11
Reputation: 8615
The only solution I have found thus far is to use ActiveSupport::Inflector.transliterate(string)
to replace the unicode characters with ASCII ones and sort:
Country.all.sort_by do |country|
ActiveSupport::Inflector.transliterate country.name
end
Now the only problem is that this equalizes "ä" with "a" (DIN 5007-1) and I end up with "Ägypten" before "Albanien" while I would expect it to be the other way around. Thankfully the transliteration is configurable about how to replace characters.
See documentation: http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate
Upvotes: 4
Reputation: 12685
The only working solution I found so far (at least for Ruby 1.8 because Ruby 1.9 should handle Unicode better) is Unicode by Yoshida Masato. You can find Unicode.strcmp method there.
EDIT: Sorry, this solution uses NFD decomposition as well with all its limitations.
Upvotes: 1
Reputation: 7856
What you are trying to do is a very messy proposition. There is no way to do transparent transliteration on all Unicode characters because the meaning of digraphs changes from locale to locale, and strings can grow HUGE (if say you replace 10 Chinese symbols with theyr phonetic equivalents). Don't go there.
Why do you want transliterated names in the first place? For URLs? Browsers handle Unicode URLs decently now, so you are inventing a huge problem out of thin air. If you need IDs, preprocess your lists to include a stable numeric ID per country and use that as an identifier. Or save the English name of the country as identitifer (you can download locale-aware ISO country lists for free).
If you truly want good transliteration for Unicode (and this is not what you want in this case) see the IBM ICU libraries, there is a dormant gem for them.
Upvotes: 0
Reputation: 115422
Have you tried accessing the mb_chars
method for each of your country strings? mb_chars
is a proxy that ActiveSupport adds that defines Unicode safe versions of all the String
methods. If the comparator is Unicode-aware then the sorting should work correctly.
Upvotes: -3
Reputation: 14515
There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:
s.split(//).collect { |x| x.unpack('U').to_s }.join
or you may use the library iconv. Read up on it and use it as appropriate (from dzone):
#add this to environment.rb
#call to_iso on any UTF8 string to get a ISO string back
#example : "Cédez le passage aux français".to_iso
class String
require 'iconv' #this line is not needed in rails !
def to_iso
Iconv.conv('ISO-8859-1', 'utf-8', self)
end
end
Upvotes: 1