Reputation: 822
I have a preg_match_all to search for words in a paragraph. It does not find Cyrillic characters, etc. How can I alter this to do all types of characters (English, Cyrillic, accented characters, etc.):
preg_match_all( '/\b' . $testWord .'\b/i', $content, $matches, PREG_OFFSET_CAPTURE );
I have tried just adding u to the end of the regex and that seems like a solution but I am asking here to see if that is the best practice or if there is a better way to do the regex I am showing.
preg_match_all( '/\b' . $testWord .'\b/iu', $content, $matches, PREG_OFFSET_CAPTURE );
Thank you
Upvotes: 0
Views: 323
Reputation: 26385
Unfortunately even with the u
modifier the word boundary shorthand \b
can act up (i.e. not match where you'd expect it to.) You'll want to replace them with negative lookarounds to check for \pL
(any letter) or \pM
(any combining accent mark.)
Like so:
preg_match_all(
'/(?<![\pL\pM])' . $testWord .'(?![\pL\pM])/iu',
$content,
$matches,
PREG_OFFSET_CAPTURE
);
Upvotes: 2