twharmon
twharmon

Reputation: 4272

preg_replace isn't working for some words/characters

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<=\b)موسیٰ(?=\b)/u', 'Musa', $str);
$str = preg_replace('/(?<=\b)سنا(?=\b)/u', 'suna', $str);
echo $str;

This fails to replace موسیٰ. It should give کس نے Musa کے بارے میں suna ہے؟ but instead gives کس نے موسیٰ کے بارے میں suna ہے؟.

This is happening for all words that end with a ٰ, like تعالیٰ . It works for words where ٰ is in the middle of the word (no words begin with a ٰ). Does this mean that \b just doesn't work with ٰ? Is it a bug?

Upvotes: 5

Views: 415

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

The reason is that a word boundary matches in the following positions:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

The "offending" symbol is U+0670 ARABIC LETTER SUPERSCRIPT ALEF belonging to \p{Mn} (nonspacing mark Unicode category), and is thus a non-word symbol. \b will match if it is preceded with a char belonging to \w (letter, digit, _).

Use unambiguous boundaries, only if the search phrase is not preceded/followed with word chars:

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<!\w)موسیٰ(?!\w)/u', 'Musa', $str);
$str = preg_replace('/(?<!\w)سنا(?!\w)/u', 'suna', $str);
echo $str; // => کس نے Musa کے بارے میں suna ہے؟

See PHP demo.

The (?<!\w) is a negative lookbehind making sure there is no word char immediately before the subsequent consuming pattern, and (?!\w) is a negative lookahead that makes sure there is no word char immediately after the preceding consuming pattern.

Upvotes: 1

revo
revo

Reputation: 48711

Be careful that:

\b and \B... are defined in terms of \w and \W.

\w matches word characters falling under ASCII table however while using (*UCP) option or u unicode modifier definition of \w changes to also include all other alphabets from other languages but not combining marks.

Saying that, \b never matches a position where a mark like ٰ sees a non-word character since the mark itself is considered a non-word character.

What you are trying to do is more like figuring out if there is any non-word character preceding or following word موسیٰ so asserting \S meta-character does the job:

(?<!\S)موسیٰ(?!\S)

Another way to accomplishing such a task would be transliterating whole input string using ICU library to remove all accents then trying to match word موسی which doesn't include the combining mark ٰ:

<?php

$strings = [
    'is' => 'کس نے موسیٰ کے بارے میں سنا ہے؟', // input string
    'wts' => 'موسیٰ' // word to search
];

array_walk($strings, function(&$value) {
    $value = transliterator_transliterate('[:Nonspacing Mark:] Remove;', $value);
});

// word boundaries now can be used
echo preg_replace('/\b' . $strings['wts'] . '\b/u', 'musa', $strings['is']);

Outputs:

کس نے musa کے بارے میں سنا ہے؟

Upvotes: 0

mickmackusa
mickmackusa

Reputation: 47894

Code:

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟'; $patterns=['/موسیٰ/u','/سنا/u']; $replacements=['Musa','suna']; echo preg_replace($patterns,$replacements,$str);

Perhaps we can spoof the word boundary by checking for space or start/end of line for the first pattern?

$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$patterns[]='/(?<= |^)موسیٰ(?= |$)/u';
$patterns[]='/\bسنا\b/u';
// or \s perhaps instead of blank space
$replacements=['Musa','suna'];
echo preg_replace($patterns,$replacements,$str);

Output:

کس نے Musa کے بارے میں suna ہے؟

Upvotes: 1

Related Questions