Reputation: 18010

Word boundery in PHP

In PHP diacritics before and after letters make word boundary (\b), that is not desired behavior. Is it normal among other programming languages? (I know most languages have issues with \b and \w) How should I solve this issue effectively?

From Unicode perspective which Unicode categories make word boundaries?

It is an example:

<?php
 preg_match_all('#\bج\b#u','مَجْل',$t); // the font of this site does not display diacritics
var_dump($t);

Upvotes: 2

Answers (2)

Real Dreams

Reputation: 18010

In PCRE:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

According to \w definition you can infer \b definition in Unicode mode. So even for string Åström (decomposed characters) that logical has two word boundary multiple word boundary will detected *A*̊*stro*̈*m*.

Upvotes: 1

HamZa

Reputation: 14921

This is just a workaround:

preg_match_all('#(\p{M}*\p{Arabic}*\p{M}*)*ج(\p{M}*\p{Arabic}*\p{M}*)*#u','مَجْل جميل testجواد',$t); // the font of this site does not display diacritics
print_r(array_filter(array_map('array_filter', $t)));

Output:

Array
(
    [0] => Array
        (
            [0] => مَجْل
            [1] => جميل
            [2] => جواد
        )

)

I found out that \p{M} will match teshkil, and \p{Arabic} will match an Arabic letter.

Upvotes: 0

Word boundery in PHP

Answers (2)

Related Questions