Matt M.
Matt M.

Reputation: 822

preg_match_all: Include non-ASCII characters

I have a preg_match_all to search for words in a paragraph. It does not find Cyrillic characters, etc. How can I alter this to do all types of characters (English, Cyrillic, accented characters, etc.):

preg_match_all( '/\b' . $testWord .'\b/i', $content, $matches, PREG_OFFSET_CAPTURE );

I have tried just adding u to the end of the regex and that seems like a solution but I am asking here to see if that is the best practice or if there is a better way to do the regex I am showing.

preg_match_all( '/\b' . $testWord .'\b/iu', $content, $matches, PREG_OFFSET_CAPTURE );

Thank you

Upvotes: 0

Views: 323

Answers (1)

user3942918
user3942918

Reputation: 26385

Unfortunately even with the u modifier the word boundary shorthand \b can act up (i.e. not match where you'd expect it to.) You'll want to replace them with negative lookarounds to check for \pL (any letter) or \pM (any combining accent mark.)

Like so:

preg_match_all(
    '/(?<![\pL\pM])' . $testWord .'(?![\pL\pM])/iu',
    $content,
    $matches,
    PREG_OFFSET_CAPTURE
);

Upvotes: 2

Related Questions