Reputation: 18010
In PHP diacritics before and after letters make word boundary (\b
), that is not desired behavior. Is it normal among other programming languages? (I know most languages have issues with \b
and \w
) How should I solve this issue effectively?
From Unicode perspective which Unicode categories make word boundaries?
It is an example:
<?php
preg_match_all('#\bج\b#u','مَجْل',$t); // the font of this site does not display diacritics
var_dump($t);
Upvotes: 2
Views: 170
Reputation: 18010
In PCRE:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore
According to \w
definition you can infer \b
definition in Unicode mode. So even for string Åström
(decomposed characters) that logical has two word boundary multiple word boundary will detected *A*̊*stro*̈*m*
.
Upvotes: 1
Reputation: 14921
This is just a workaround:
preg_match_all('#(\p{M}*\p{Arabic}*\p{M}*)*ج(\p{M}*\p{Arabic}*\p{M}*)*#u','مَجْل جميل testجواد',$t); // the font of this site does not display diacritics
print_r(array_filter(array_map('array_filter', $t)));
Output:
Array
(
[0] => Array
(
[0] => مَجْل
[1] => جميل
[2] => جواد
)
)
I found out that \p{M}
will match teshkil, and \p{Arabic}
will match an Arabic letter.
Upvotes: 0