StackOverflowNewbie
StackOverflowNewbie

Reputation: 40643

REGEX - how to do diacritic-insensitive in preg_match?

Is there a way to use preg_match (e.g. perhaps via a flag) to do diacritic-insensitive matches?

For example, say I'd like it to match:

I know I can do a regex like this: caf[eé]. This regex will work as long as I don't come across any other diacritic variations of e, like: ê è ë ē ĕ ě ẽ ė ẹ ę ẻ.

Of course, I could just list all of those diacritic variations in my regex, such as caf[eêéèëēĕěẽėẹęẻ]. And as long as I don't miss anything, I'll be good. I would just need to do this for all the letters in the alphabet, which is a tedious and prone-to-error solution.

It is not an option for me to find and replace the diacritic letters in the subject with their non-diacritic counterparts. I need to preserve the subject as-is.

The ideal solution for me is to have regex to be diacritic-insensitive. With the example above, I want my regex to simply be: cafe. Is this possible?

Upvotes: 2

Views: 548

Answers (1)

Robo Mop
Robo Mop

Reputation: 3553

If you're open to matching a letter from any language (which includes characters with dicritic), then you could use \p{L} or \p{Letter} as shown here: https://regex101.com/r/UBGQI6/3

According to regular-expressions.info,

\p{L} or \p{Letter}: any kind of letter from any language.

  • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
  • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
  • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
  • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
  • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
  • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.

The only catch is that you can't search for particular letters with a diacritic such as È, and so you can't limit your search to English letters.

Upvotes: 1

Related Questions