Reputation: 11
I made a strange discovery. If I execute the following SQL-Command:
SELECT 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
it gives me the result - as expected - 0
But if I do the same together with the BINARY operator:
SELECT BINARY 'Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]'
the result ist 1, so I think there is a MySQL problem with the regexp word boundary detection and german umlauts (like here the "ä") in conjunction with the BINARY Operator. As another example I can do this query:
SELECT BINARY 'Konzessionsäre' REGEXP '[[:<:]]Konzession[[:>:]]'
So here the result is 0 - as I would expect. So how can I solve this? Is this probably a bug in MySQL?
Thanks
Upvotes: 1
Views: 574
Reputation: 142296
MySQL's REGEXP works with bytes, not characters. So, in CHARACTER SET utf8
, ä
is 2 bytes. It is unclear what the definition of "word boundary" in such a situation.
Recent versions of MariaDB have a better regexp engine.
Upvotes: 1
Reputation: 562378
By casting your string as BINARY
you have stripped its associated character set property. So it's unclear how the word-boundary pattern should match. I'd guess it matches only ASCII values A-Z, a-z, 0-9, and also _.
When casting the string as BINARY
, MySQL knows nothing about any other higher character values that also should be considered alphanumeric, because which characters should be alphanumeric depends on the character set.
I guess you are using BINARY
to make this a case-sensitive regular expression search. Apparently, this has the unintended consequence of spoiling the word-boundary pattern-match.
You should not use BINARY
in this comparison. You could do a secondary comparison to check for case-sensitive matching, but not with word boundaries.
SELECT (BINARY 'Konzessionäre' REGEXP 'Konzession') AND ('Konzessionäre' REGEXP '[[:<:]]Konzession[[:>:]]')
Upvotes: 1