Sean Mackesey
Sean Mackesey

Reputation: 10939

Replacing accented characters in Ruby 1.9.3, without Rails

I would like to use Ruby 1.9.3 to replace accented UTF-8 characters with their ASCII equivalents. For example,

Acsády  -->  Acsady

The traditional way to do this is using the IConv package, which is part of Ruby's standard library. You can do something like this:

str = 'Acsády'
IConv.iconv('ascii//TRANSLIT', 'utf8', str)

Which will yield

Acsa'dy

One then has to delete the apostrophes. While this method still works in Ruby 1.9.3, I get a warning saying that IConv is deprecated and that String#encode should be used instead. However, String#encode does not offer exactly the same functionality. Undefined characters throw an exception by default, but you can handle them by either setting :undef=>:replace (which replaces undefined chars with the default '?' char) or the :fallback option to a hash which maps undefined source encoding characters to target encoding. I am wondering whether there are standard :fallback hashes available in the standard library or through some gem, such that I don't have to write my own hash to handle all possible accent marks.

@raina77ow: Thanks for the response. That's exactly what I was looking for. However, after looking at the thread you linked to I realized that a better solution may be to simply match unaccented characters to their accented equivalents, in the way that databases use a character set collation. Does Ruby have anything equivalent to collations?

Upvotes: 2

Views: 2402

Answers (3)

user1142217
user1142217

Reputation:

The following code will work for a pretty wide variety of European languages, including Greek, which is hard to get right and is not handled by the previous answers.

# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿΆΈΊΌΐάέήίΰϊϋόύώỏἀἁἂἃἄἅἆἈἉἊἌἍἎἐἑἒἓἔἕἘἙἜἝἠἡἢἣἤἥἦἧἨἩἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἼἽἾὀὁὂὃὄὅὈὉὊὋὌὍὐὑὓὔὕὖὗὙὝὠὡὢὣὤὥὦὧὨὩὫὬὭὮὯὰὲὴὶὸὺὼᾐᾑᾓᾔᾕᾖᾗᾠᾤᾦᾧᾰᾱᾳᾴᾶᾷᾸᾹῂῃῄῆῇῐῑῒῖῗῘῙῠῡῢῥῦῨῩῬῳῴῶῷῸ","AAAAAAÆCEEEEIIIINOOOOOOUUUUYaaaaaaæceeeeiiiinoooooouuuuyyΑΕΙΟιαεηιυιυουωoαααααααΑΑΑΑΑΑεεεεεεΕΕΕΕηηηηηηηηΗΗΗΗΗΗΗιιιιιιιιΙΙΙΙΙοοοοοοΟΟΟΟΟΟυυυυυυυΥΥωωωωωωωωΩΩΩΩΩΩΩαεηιουωηηηηηηηωωωωααααααΑΑηηηηηιιιιιΙΙυυυρυΥΥΡωωωωΟ")
end

It was generated by the following long, slow program, which shells out to the linux command-line utility "unicode." If you come across characters that are missing from this list, add them to the longer program, re-run it, and you'll get code output that will handle those characters. For example, I think the list is missing some characters that occur in Czech, such as a c with a wedge on it, as well as Latin-language vowels with macrons. If these new characters have accents on them that aren't on the list below, the program will not strip them until you add the names of the new accents to names_of_accents.

$stderr.print %q{
This program generates ruby code to strip accents from characters in Latin and Greek scripts.
Progress will be printed to stderr, the final result to stdout.
}

all_characters = %q{
         ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿ
         ΆΈΊΌΐάέήίϊόύώỏἀἁἃἄἅἈἐἑἒἔἕἘἙἜἡἢἣἤἥἦἨἩἫἬἮἰἱἲἴἵἶἸὀὁὂὃὄὅὊὍὐὑὓὔὕὖὗὝὡὢὣὤὥὧὨὩὰὲὴὶὸὺὼᾐᾗᾳᾴᾶῂῆῇῖῥῦῳῶῷῸᾤᾷἂἷ
         ὌᾖὉἧἷἂῃἌὬὉἷὉἷῃὦἌἠἳᾔἉᾦἠἳᾔὠᾓὫἝὈἭἼϋὯῴἆῒῄΰῢἆὙὮᾧὮᾕὋἍἹῬἽᾕἓἯἾᾠἎῗἾῗἯἊὭἍᾑᾰῐῠᾱῑῡᾸῘῨᾹῙῩ
}.gsub(/\s/,'')
# The first line is a list of accented Latin characters. The second and third lines are polytonic Greek.
# The Greek on this list includes every character occurring in the Project Gutenberg editions of Homer, except for some that seem to be
# mistakes (smooth rho, phi and theta in symbol font). Duplications and characters out of order in this list have no effect at run time.
# Also includes vowels with macron and vrachy, which occur in Project Perseus texts sometimes.

# The following code shells out to the linux command-line utility called "unicode," which is installed as the debian package
# of the same name.
# Documentation: https://github.com/garabik/unicode/blob/master/README

names_of_accents = %q{
  acute grave circ and rough smooth ypogegrammeni diar with macron vrachy tilde ring above diaeresis cedilla stroke
  tonos dialytika hook perispomeni dasia varia psili oxia
}.split(/\s+/).select { |x| x.length>0}.sort.uniq
# The longer "circumflex" will first be shortened to "circ" in later code.

def char_to_name(c)
  return `unicode --string "#{c}" --format "{name}"`.downcase
end

def name_to_char(name)
   list = `unicode "#{name}" --format "{pchar}" --max 0` # returns a string of possibilities, not just exact matches
   # Usually, but not always, the unaccented character is the first on the list.
   list.chars.each { |c|
     if char_to_name(c)==name then return c end
   }
   raise "Unable to convert name #{name} to a character, list=#{list}."
end

regex = "( (#{names_of_accents.join("|")}))+"
from = ''
to = ''
all_characters.chars.sort.uniq.each { |c|
  name = char_to_name(c).gsub(/circumflex/,'circ')
  name.gsub!(/#{regex}/,'')
  without_accent = name_to_char(name)
  from = from+c.unicode_normalize(:nfc)
  to = to+without_accent.unicode_normalize(:nfc)
  $stderr.print c
}
$stderr.print "\n"
print %Q{
# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("#{from}","#{to}")
end
}

Upvotes: 0

Joe Lalgee
Joe Lalgee

Reputation: 972

I use this:

def convert_to_ascii(s)
  undefined = ''
  fallback = { 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A',
               'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 'È' => 'E', 'É' => 'E',
               'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I',
               'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O',
               'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U',
               'Û' => 'U', 'Ü' => 'U', 'Ý' => 'Y', 'à' => 'a', 'á' => 'a',
               'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae',
               'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
               'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n',
               'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o',
               'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u',
               'ý' => 'y', 'ÿ' => 'y' }
  s.encode('ASCII',
           fallback: lambda { |c| fallback.key?(c) ? fallback[c] : undefined })
end

You can check for other symbols you might want to provide fallback for here

Upvotes: 3

raina77ow
raina77ow

Reputation: 106385

I suppose what you look for is similar to this question. If it is, you can use the ports of Text::Unidecode written for Ruby - like this gem (or this fork of it, looks like it's ready to be used in 1.9), for example.

Upvotes: 0

Related Questions