Reputation: 16751
So I have a weir issue. I have files on S3 and I want to match words against the filename, using a regular expression. The regular expression should clearly match, but it does not:
irb(main):048:0> p.original_filename
=> "Küche.png"
irb(main):049:0> "Küche.png"
=> "Küche.png"
irb(main):013:0> reg = /(\A|\W|\d|_)#{Regexp.quote("Küche")}(\W|\z|\d|_)/i
=> /(\A|\W|\d|_)Küche(\W|\z|\d|_)/i
irb(main):014:0> reg.match?("Küche.png")
=> true
irb(main):015:0> reg.match?(p.original_filename)
=> false
irb(main):050:0> p.original_filename == "Küche.png"
=> false
So I inspected further and already assumed, that there is an encoding issue:
irb(main):017:0> p.original_filename.encoding
=> #<Encoding:UTF-8>
irb(main):018:0> "Küche.png".encoding
=> #<Encoding:UTF-8>
That is weird. But let's see what characters and bytes are behind the representation:
irb(main):025:0> "Küche.png".chars
=> ["K", "ü", "c", "h", "e", ".", "p", "n", "g"]
irb(main):026:0> p.original_filename.chars
=> ["K", "u", "̈", "c", "h", "e", ".", "p", "n", "g"]
irb(main):032:0> p.original_filename.bytes
=> [75, 117, 204, 136, 99, 104, 101, 46, 112, 110, 103]
irb(main):033:0> "Küche.png".bytes
=> [75, 195, 188, 99, 104, 101, 46, 112, 110, 103]
So here is the issue. My question: How can I normalize the filename so that it matches with the regexp /(\A|\W|\d|_)#{Regexp.quote("Küche")}(\W|\z|\d|_)/i
? I tried force_encoding
and encode
without success, since the bytes are different.
Note: It is not an option to only use ASCII characters for the filenames. This has to work with umlaut.
Upvotes: 3
Views: 384
Reputation: 114188
In Unicode, certain characters can be represented as a base character and a combining character:
"\u0075\u0308" # LATIN SMALL LETTER U (U+0075) + COMBINING DIAERESIS (U+0308)
#=> "ü"
or as a single precomposed character:
"\u00fc" # LATIN SMALL LETTER U WITH DIAERESIS (U+00FC)
#=> "ü"
If you type characters on your keyboard, it will usually generate the latter.
The process of converting a given string into one of these representations is called normalization.
In Ruby, there's String#unicode_normalize
which defaults to the latter format:
a = "\u0075\u0308"
b = "\u00fc"
a.codepoints #=> [117, 776]
b.codepoints #=> [252]
a.unicode_normalize.codepoints #=> [252]
Applied to your example, you'd use:
reg.match?(p.original_filename.unicode_normalize)
Upvotes: 6