Reputation: 1771
I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.
I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.
# Utility method that retursn an ASCIIfied, downcased, and sanitized string.
# It relies on the Unicode Hacks plugin by means of String#chars. We assume
# $KCODE is 'u' in environment.rb. By now we support a wide range of latin
# accented letters, based on the Unicode Character Palette bundled inMacs.
def self.normalize(str)
n = str.chars.downcase.strip.to_s
n.gsub!(/[à áâãäåÄÄ?]/u, 'a')
n.gsub!(/æ/u, 'ae')
n.gsub!(/[ÄÄ?]/u, 'd')
n.gsub!(/[çÄ?ÄÄ?Ä?]/u, 'c')
n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
n.gsub!(/Æ?/u, 'f')
n.gsub!(/[ÄÄ?Ä¡Ä£]/u, 'g')
n.gsub!(/[ĥħ]/, 'h')
n.gsub!(/[ììÃîïīĩÄ]/u, 'i')
n.gsub!(/[įıijĵ]/u, 'j')
n.gsub!(/[ķĸ]/u, 'k')
n.gsub!(/[Å?ľĺļÅ?]/u, 'l')
n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u, 'n')
n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u, 'o')
n.gsub!(/Å?/u, 'oe')
n.gsub!(/Ä?/u, 'q')
n.gsub!(/[Å?Å?Å?]/u, 'r')
n.gsub!(/[Å?Å¡Å?ÅÈ?]/u, 's')
n.gsub!(/[ťţŧÈ?]/u, 't')
n.gsub!(/[ùúûüūůűÅũų]/u,'u')
n.gsub!(/ŵ/u, 'w')
n.gsub!(/[ýÿŷ]/u, 'y')
n.gsub!(/[žżź]/u, 'z')
n.gsub!(/\s+/, ' ')
n.gsub!(/[^\sa-z0-9_-]/, '')
n
end
Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.
I am not using Rails, nor do I plan on doing so.
Upvotes: 88
Views: 43317
Reputation: 21
You can remove the accents normalizing the string and keeping only the base characters, like this:
def normalize_string(string)
string.unicode_normalize(:nfkd).encode('ASCII', replace: '')
end
You can found more details about the normalization on this link: https://www.unicode.org/reports/tr15/
Basically, .unicode_normalize(:nfkd)
will normalize the string by decomposing complex characters like: á ã à in base + diacritic form.
This will separate accents and non-ASCII characters from the base character of the string. After that, with .encode('ASCII', replace: '')
you transform the string into ASCII, replacing everything that is not ASCII in it with ''
.
Upvotes: 2
Reputation: 109
Solution:
DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')
def removeaccents(str)
str
.unicode_normalize(:nfd)
.tr(DIACRITICS, '')
.unicode_normalize(:nfc)
end
Example (before/after):
ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčĎďÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľÑñŃńŅņŇňÒÓÔÕÖòóôộỗổõöŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ
AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdEEEEeeeeeeeEeEeEeEeEeeGgGgGgGgHhIIIIiiiiIiIiIiIiIıiiJjKkĸLlLlLlNnNnNnNnOOOOOooooooooOoOoOoooooooRrRrRrSsSsSsSsſTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa
Explanations:
Combining Diacritical Marks Supplement
(U+1DC0 → U+1DFF)Combining Diacritical Marks
(U+0300 → U+036F)Combining Half Marks
(U+FE20 → U+FE2F)Caveats:
◌ٔ
(U+0654) that probably doesn't even get properly displayed in your browser.Upvotes: 9
Reputation: 3243
The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:
> "Françoise Isaïe".parameterize
=> "francoise-isaie"
Upvotes: 39
Reputation: 6937
I generally use I18n to handle this:
1.9.3p392 :001 > require "i18n"
=> true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
=> "He les mecs!"
Upvotes: 258
Reputation: 1771
So far the following is the only way I've been able to accomplish what I need:
str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")
But using this feels very 'hackish', and I would love to find a better way.
Upvotes: 20