Replace accented character in Ruby

Question

I had a script or Ruby, and when I try to replace accented charcater gsub doesn't work with me :

my floder name is "Réé Ab"

name = File.basename(Dir.getwd)
name.downcase!
name.gsub!(/[àáâãäå]/,'a')
name.gsub!(/æ/,'ae')
name.gsub!(/ç/, 'c')
name.gsub!(/[èéêë]/,'e')
name.gsub!(/[ìíîï]/,'i')
name.gsub!(/[ýÿ]/,'y')
name.gsub!(/[òóôõö]/,'o')
name.gsub!(/[ùúûü]/,'u')

the output "réé ab", why the accented characters stil there ?

cremno · Accepted Answer

The é in your name are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).

p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"

However the é in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.

How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name simpy call String#unicode_normalize with :nfc or :nfkc as argument to compose é (U+0065 and U+0301) to é (U+00E9):

name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!

Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.

I also like to or even should point out that converting é to e or ü to u causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).

Replace accented character in Ruby

Answers (1)

Related Questions