Reputation: 40643
Is there a way to use preg_match
(e.g. perhaps via a flag) to do diacritic-insensitive matches?
For example, say I'd like it to match:
I know I can do a regex like this: caf[eé]
. This regex will work as long as I don't come across any other diacritic variations of e
, like: ê è ë ē ĕ ě ẽ ė ẹ ę ẻ
.
Of course, I could just list all of those diacritic variations in my regex, such as caf[eêéèëēĕěẽėẹęẻ]
. And as long as I don't miss anything, I'll be good. I would just need to do this for all the letters in the alphabet, which is a tedious and prone-to-error solution.
It is not an option for me to find and replace the diacritic letters in the subject with their non-diacritic counterparts. I need to preserve the subject as-is.
The ideal solution for me is to have regex to be diacritic-insensitive. With the example above, I want my regex to simply be: cafe
. Is this possible?
Upvotes: 2
Views: 548
Reputation: 3553
If you're open to matching a letter from any language (which includes characters with dicritic), then you could use \p{L}
or \p{Letter}
as shown here: https://regex101.com/r/UBGQI6/3
According to regular-expressions.info,
\p{L} or \p{Letter}: any kind of letter from any language.
- \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
- \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
- \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
- \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
- \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
- \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
The only catch is that you can't search for particular letters with a diacritic such as È
, and so you can't limit your search to English letters.
Upvotes: 1