Gus Shortz
Gus Shortz

Reputation: 1771

Ruby method to remove accents from UTF-8 international characters

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

  # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
  # It relies on the Unicode Hacks plugin by means of String#chars. We assume
  # $KCODE is 'u' in environment.rb. By now we support a wide range of latin
  # accented letters, based on the Unicode Character Palette bundled inMacs.
  def self.normalize(str)
     n = str.chars.downcase.strip.to_s
     n.gsub!(/[à áâãäåÄÄ?]/u,    'a')
     n.gsub!(/æ/u,                  'ae')
     n.gsub!(/[ÄÄ?]/u,                'd')
     n.gsub!(/[çÄ?ÄÄ?Ä?]/u,          'c')
     n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
     n.gsub!(/Æ?/u,                   'f')
     n.gsub!(/[ÄÄ?Ä¡Ä£]/u,            'g')
     n.gsub!(/[ĥħ]/,                'h')
     n.gsub!(/[ììíîïīĩĭ]/u,     'i')
     n.gsub!(/[įıijĵ]/u,           'j')
     n.gsub!(/[ķĸ]/u,               'k')
     n.gsub!(/[Å?ľĺļÅ?]/u,         'l')
     n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u,       'n')
     n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u,  'o')
     n.gsub!(/Å?/u,                  'oe')
     n.gsub!(/Ä?/u,                   'q')
     n.gsub!(/[Å?Å?Å?]/u,             'r')
     n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
     n.gsub!(/[ťţŧÈ?]/u,           't')
     n.gsub!(/[ùúûüūůűŭũų]/u,'u')
     n.gsub!(/ŵ/u,                   'w')
     n.gsub!(/[ýÿŷ]/u,             'y')
     n.gsub!(/[žżź]/u,             'z')
     n.gsub!(/\s+/,                   ' ')
     n.gsub!(/[^\sa-z0-9_-]/,          '')
     n
  end

Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

I am not using Rails, nor do I plan on doing so.

Upvotes: 88

Views: 43317

Answers (6)

Gustavo Marques
Gustavo Marques

Reputation: 21

You can remove the accents normalizing the string and keeping only the base characters, like this:

def normalize_string(string)
  string.unicode_normalize(:nfkd).encode('ASCII', replace: '')
end

You can found more details about the normalization on this link: https://www.unicode.org/reports/tr15/

Basically, .unicode_normalize(:nfkd) will normalize the string by decomposing complex characters like: á ã à in base + diacritic form. This will separate accents and non-ASCII characters from the base character of the string. After that, with .encode('ASCII', replace: '') you transform the string into ASCII, replacing everything that is not ASCII in it with ''.

Upvotes: 2

devnoname120
devnoname120

Reputation: 109

Solution:

DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')

def removeaccents(str)
  str
    .unicode_normalize(:nfd)
    .tr(DIACRITICS, '')
    .unicode_normalize(:nfc)
end

Example (before/after):

ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa

Explanations:

  • Decompose the single-codepoint characters into their constituting codepoints characters (where applicable).
  • Remove the diacritical mark codepoints (Unicode 15.0.0 reference) found in the following blocks:
    • Combining Diacritical Marks Supplement (U+1DC0 → U+1DFF)
    • Combining Diacritical Marks (U+0300 → U+036F)
    • Combining Half Marks (U+FE20 → U+FE2F)
  • Recompose the characters.

Caveats:

  • While these diacritics are predominantly used for text, some of them can also be used with symbols. These symbols will see these diacritics removed when they shouldn't be.
  • Obscure codepoints such as subtending marks are not removed. Despite their naming, they are not treated as combining marks by the unicode reference but as format characters. An example is the arabic hamza above ◌ٔ (U+0654) that probably doesn't even get properly displayed in your browser.
  • Not a caveat per se but worth nothing: diacritics that are preceded by a space or a breaking space are also removed. They are displayed as standalone characters in some text-rendering software so it may be undesired.

Upvotes: 9

Navid Khan
Navid Khan

Reputation: 1169

If you are using rails:

"L'Oréal".parameterize(separator: ' ')

Upvotes: 6

AlexGuti
AlexGuti

Reputation: 3243

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"

Upvotes: 39

user2398029
user2398029

Reputation: 6937

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
 => true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
 => "He les mecs!"

Upvotes: 258

Gus Shortz
Gus Shortz

Reputation: 1771

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

Upvotes: 20

Related Questions