preg_match_all: Include non-ASCII characters

Question

I have a preg_match_all to search for words in a paragraph. It does not find Cyrillic characters, etc. How can I alter this to do all types of characters (English, Cyrillic, accented characters, etc.):

preg_match_all( '/\b' . $testWord .'\b/i', $content, $matches, PREG_OFFSET_CAPTURE );

I have tried just adding u to the end of the regex and that seems like a solution but I am asking here to see if that is the best practice or if there is a better way to do the regex I am showing.

preg_match_all( '/\b' . $testWord .'\b/iu', $content, $matches, PREG_OFFSET_CAPTURE );

Thank you

user3942918 · Accepted Answer

Unfortunately even with the u modifier the word boundary shorthand \b can act up (i.e. not match where you'd expect it to.) You'll want to replace them with negative lookarounds to check for \pL (any letter) or \pM (any combining accent mark.)

Like so:

preg_match_all(
    '/(?

preg_match_all: Include non-ASCII characters

Answers (1)

Related Questions