Daniel Vandersluis
Daniel Vandersluis

Reputation: 94304

Sorting UTF-8 strings in RoR

I am trying to figure out a 'proper' way of sorting UTF-8 strings in Ruby on Rails.

In my application, I have a select box that is populated with countries. As my application is localized, each existing locale has a countries.yml file that relates a country's id to the localized name for that country. I can't sort the strings manually in the yml file because I need the ID to be consistent across all locales.

What I have done is create a ascii_name method which uses the unidecode gem to convert accented and non-latin characters to their ascii equivalent (for instance, "Afeganistão" would become "Afeganistao"), and then sort on that:

require 'unidecode'

class Country
  def ascii_name
    Unidecoder.decode(name).gsub("[?]", "").gsub(/`/, "'").strip
  end
end

Country.all.sort_by(:&ascii_name)

However, there are obvious issues with this:

Does anyone know of a better way that I could sort my strings?

Upvotes: 15

Views: 6871

Answers (7)

İ. Emre Kutlu
İ. Emre Kutlu

Reputation: 138

http://github.com/grosser/sort_alphabetical

This gem should help. It adds sort_alphabetical and sort_alphabetical_by methods to Enumberable.

Upvotes: 9

toro2k
toro2k

Reputation: 19238

Ruby peforms string comparisons based on byte values of characters:

%w[à a e].sort
# => ["a", "e", "à"]

To properly collate strings according to locale, the ffi-icu gem could be used:

require "ffi-icu"

ICU::Collation.collate("it_IT", %w[à a e])
# => ["a", "à", "e"]

ICU::Collation.collate("de", %w[a s x ß])
# => ["a", "s", "ß", "x"]

As an alternative:

collator = ICU::Collation::Collator.new("it_IT")
%w[à a e].sort { |a, b| collator.compare(a, b) }
# => %w[a à e]

Update To test how strings should collate according to locale rules the ICU project provides this nice tool.

Upvotes: 11

Kostas
Kostas

Reputation: 8615

The only solution I have found thus far is to use ActiveSupport::Inflector.transliterate(string) to replace the unicode characters with ASCII ones and sort:

Country.all.sort_by do |country|
  ActiveSupport::Inflector.transliterate country.name
end

Now the only problem is that this equalizes "ä" with "a" (DIN 5007-1) and I end up with "Ägypten" before "Albanien" while I would expect it to be the other way around. Thankfully the transliteration is configurable about how to replace characters.

See documentation: http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate

Upvotes: 4

skalee
skalee

Reputation: 12685

The only working solution I found so far (at least for Ruby 1.8 because Ruby 1.9 should handle Unicode better) is Unicode by Yoshida Masato. You can find Unicode.strcmp method there.

EDIT: Sorry, this solution uses NFD decomposition as well with all its limitations.

Upvotes: 1

Julik
Julik

Reputation: 7856

What you are trying to do is a very messy proposition. There is no way to do transparent transliteration on all Unicode characters because the meaning of digraphs changes from locale to locale, and strings can grow HUGE (if say you replace 10 Chinese symbols with theyr phonetic equivalents). Don't go there.

Why do you want transliterated names in the first place? For URLs? Browsers handle Unicode URLs decently now, so you are inventing a huge problem out of thin air. If you need IDs, preprocess your lists to include a stable numeric ID per country and use that as an identifier. Or save the English name of the country as identitifer (you can download locale-aware ISO country lists for free).

If you truly want good transliteration for Unicode (and this is not what you want in this case) see the IBM ICU libraries, there is a dormant gem for them.

Upvotes: 0

John Topley
John Topley

Reputation: 115422

Have you tried accessing the mb_chars method for each of your country strings? mb_chars is a proxy that ActiveSupport adds that defines Unicode safe versions of all the String methods. If the comparator is Unicode-aware then the sorting should work correctly.

Upvotes: -3

Ryan Oberoi
Ryan Oberoi

Reputation: 14515

There are a couple of ways to go. You may want to convert the UTF strings to hex strings and then sort them:

s.split(//).collect { |x| x.unpack('U').to_s }.join

or you may use the library iconv. Read up on it and use it as appropriate (from dzone):

#add this to environment.rb
#call to_iso on any UTF8 string to get a ISO string back
#example : "Cédez le passage aux français".to_iso

class String
  require 'iconv' #this line is not needed in rails !
  def to_iso
    Iconv.conv('ISO-8859-1', 'utf-8', self)
  end
end

Upvotes: 1

Related Questions