Ole Spaarmann
Ole Spaarmann

Reputation: 16751

How to match umlaute (äöü) in regular expressions when they have different byte representation?

So I have a weir issue. I have files on S3 and I want to match words against the filename, using a regular expression. The regular expression should clearly match, but it does not:

irb(main):048:0> p.original_filename
=> "Küche.png"
irb(main):049:0> "Küche.png"
=> "Küche.png"
irb(main):013:0> reg = /(\A|\W|\d|_)#{Regexp.quote("Küche")}(\W|\z|\d|_)/i
=> /(\A|\W|\d|_)Küche(\W|\z|\d|_)/i
irb(main):014:0> reg.match?("Küche.png")
=> true
irb(main):015:0> reg.match?(p.original_filename)
=> false
irb(main):050:0> p.original_filename == "Küche.png"
=> false

So I inspected further and already assumed, that there is an encoding issue:

irb(main):017:0> p.original_filename.encoding
=> #<Encoding:UTF-8>
irb(main):018:0> "Küche.png".encoding
=> #<Encoding:UTF-8>

That is weird. But let's see what characters and bytes are behind the representation:

irb(main):025:0> "Küche.png".chars
=> ["K", "ü", "c", "h", "e", ".", "p", "n", "g"]
irb(main):026:0> p.original_filename.chars
=> ["K", "u", "̈", "c", "h", "e", ".", "p", "n", "g"]
irb(main):032:0> p.original_filename.bytes
=> [75, 117, 204, 136, 99, 104, 101, 46, 112, 110, 103]
irb(main):033:0> "Küche.png".bytes
=> [75, 195, 188, 99, 104, 101, 46, 112, 110, 103]

So here is the issue. My question: How can I normalize the filename so that it matches with the regexp /(\A|\W|\d|_)#{Regexp.quote("Küche")}(\W|\z|\d|_)/i ? I tried force_encoding and encode without success, since the bytes are different.

Note: It is not an option to only use ASCII characters for the filenames. This has to work with umlaut.

Upvotes: 3

Views: 384

Answers (1)

Stefan
Stefan

Reputation: 114188

In Unicode, certain characters can be represented as a base character and a combining character:

"\u0075\u0308" # LATIN SMALL LETTER U (U+0075) + COMBINING DIAERESIS (U+0308)
#=> "ü"

or as a single precomposed character:

"\u00fc"       # LATIN SMALL LETTER U WITH DIAERESIS (U+00FC)
#=> "ü"

If you type characters on your keyboard, it will usually generate the latter.

The process of converting a given string into one of these representations is called normalization.

In Ruby, there's String#unicode_normalize which defaults to the latter format:

a = "\u0075\u0308"
b = "\u00fc"

a.codepoints                   #=> [117, 776]
b.codepoints                   #=> [252]
a.unicode_normalize.codepoints #=> [252]

Applied to your example, you'd use:

reg.match?(p.original_filename.unicode_normalize)

Upvotes: 6

Related Questions