Reputation: 4272
$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<=\b)موسیٰ(?=\b)/u', 'Musa', $str);
$str = preg_replace('/(?<=\b)سنا(?=\b)/u', 'suna', $str);
echo $str;
This fails to replace موسیٰ
. It should give کس نے Musa کے بارے میں suna ہے؟
but instead gives کس نے موسیٰ کے بارے میں suna ہے؟
.
This is happening for all words that end with a ٰ
, like تعالیٰ
. It works for words where ٰ
is in the middle of the word (no words begin with a ٰ
). Does this mean that \b
just doesn't work with ٰ
? Is it a bug?
Upvotes: 5
Views: 415
Reputation: 626870
The reason is that a word boundary matches in the following positions:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
The "offending" symbol is U+0670
ARABIC LETTER SUPERSCRIPT ALEF
belonging to \p{Mn}
(nonspacing mark Unicode category), and is thus a non-word symbol. \b
will match if it is preceded with a char belonging to \w
(letter, digit, _
).
Use unambiguous boundaries, only if the search phrase is not preceded/followed with word chars:
$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$str = preg_replace('/(?<!\w)موسیٰ(?!\w)/u', 'Musa', $str);
$str = preg_replace('/(?<!\w)سنا(?!\w)/u', 'suna', $str);
echo $str; // => کس نے Musa کے بارے میں suna ہے؟
See PHP demo.
The (?<!\w)
is a negative lookbehind making sure there is no word char immediately before the subsequent consuming pattern, and (?!\w)
is a negative lookahead that makes sure there is no word char immediately after the preceding consuming pattern.
Upvotes: 1
Reputation: 48711
\b
and\B
... are defined in terms of\w
and\W
.
\w
matches word characters falling under ASCII table however while using (*UCP)
option or u
unicode modifier definition of \w
changes to also include all other alphabets from other languages but not combining marks.
Saying that, \b
never matches a position where a mark like ٰ
sees a non-word character since the mark itself is considered a non-word character.
What you are trying to do is more like figuring out if there is any non-word character preceding or following word موسیٰ
so asserting \S
meta-character does the job:
(?<!\S)موسیٰ(?!\S)
Another way to accomplishing such a task would be transliterating whole input string using ICU library to remove all accents then trying to match word موسی
which doesn't include the combining mark ٰ
:
<?php
$strings = [
'is' => 'کس نے موسیٰ کے بارے میں سنا ہے؟', // input string
'wts' => 'موسیٰ' // word to search
];
array_walk($strings, function(&$value) {
$value = transliterator_transliterate('[:Nonspacing Mark:] Remove;', $value);
});
// word boundaries now can be used
echo preg_replace('/\b' . $strings['wts'] . '\b/u', 'musa', $strings['is']);
Outputs:
کس نے musa کے بارے میں سنا ہے؟
Upvotes: 0
Reputation: 47894
Code:
$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$patterns=['/موسیٰ/u','/سنا/u'];
$replacements=['Musa','suna'];
echo preg_replace($patterns,$replacements,$str);
Perhaps we can spoof the word boundary by checking for space or start/end of line for the first pattern?
$str = 'کس نے موسیٰ کے بارے میں سنا ہے؟';
$patterns[]='/(?<= |^)موسیٰ(?= |$)/u';
$patterns[]='/\bسنا\b/u';
// or \s perhaps instead of blank space
$replacements=['Musa','suna'];
echo preg_replace($patterns,$replacements,$str);
Output:
کس نے Musa کے بارے میں suna ہے؟
Upvotes: 1