Reputation: 1009
I had a script or Ruby, and when I try to replace accented charcater gsub
doesn't work with me :
my floder name is "Réé Ab"
name = File.basename(Dir.getwd)
name.downcase!
name.gsub!(/[àáâãäå]/,'a')
name.gsub!(/æ/,'ae')
name.gsub!(/ç/, 'c')
name.gsub!(/[èéêë]/,'e')
name.gsub!(/[ìíîï]/,'i')
name.gsub!(/[ýÿ]/,'y')
name.gsub!(/[òóôõö]/,'o')
name.gsub!(/[ùúûü]/,'u')
the output "réé ab"
, why the accented characters stil there ?
Upvotes: 2
Views: 1380
Reputation: 4927
The é
in your name
are actually two different Unicode codepoints: U+0065 (LATIN SMALL LETTER E
) and U+0301 (COMBINING ACUTE ACCENT
).
p 'é'.each_codepoint.map{|e|"U+#{e.to_s(16).upcase.rjust(4,'0')}"} * ' ' # => "U+0065 U+0301"
However the é
in your regex is only one: U+00E9 (LATIN SMALL LETTER E WITH ACUTE
). Wikipedia has an article about Unicode equivalence. The official Unicode FAQ also contains explanations and information about this topic.
How to normalize Unicode strings in Ruby depends on its version. It has Unicode normalization support since 2.2. You don't have to require a library or install a gem like in previous versions (here's an overview). To normalize name
simpy call String#unicode_normalize
with :nfc
or :nfkc
as argument to compose é
(U+0065 and U+0301) to é
(U+00E9):
name = File.basename(Dir.getwd)
name.unicode_normalize! # thankfully :nfc is the default
name.downcase!
Of course, you could also use decomposed characters in your regular expressions but that probably won't work on other file systems and then you would also have to normalize: NFD or NFKD to decompose.
I also like to or even should point out that converting é
to e
or ü
to u
causes information loss. For example, the German word Müll (trash) would be converted to Mull (mull / forest humus).
Upvotes: 6