Real Dreams
Real Dreams

Reputation: 18010

Word boundery in PHP

In PHP diacritics before and after letters make word boundary (\b), that is not desired behavior. Is it normal among other programming languages? (I know most languages have issues with \b and \w) How should I solve this issue effectively?

From Unicode perspective which Unicode categories make word boundaries?

It is an example:

<?php
 preg_match_all('#\bج\b#u','مَجْل',$t); // the font of this site does not display diacritics
var_dump($t);

Upvotes: 2

Views: 170

Answers (2)

Real Dreams
Real Dreams

Reputation: 18010

In PCRE:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

According to \w definition you can infer \b definition in Unicode mode. So even for string Åström (decomposed characters) that logical has two word boundary multiple word boundary will detected *A*̊*stro*̈*m*.

Upvotes: 1

HamZa
HamZa

Reputation: 14921

This is just a workaround:

preg_match_all('#(\p{M}*\p{Arabic}*\p{M}*)*ج(\p{M}*\p{Arabic}*\p{M}*)*#u','مَجْل جميل testجواد',$t); // the font of this site does not display diacritics
print_r(array_filter(array_map('array_filter', $t)));

Output:

Array
(
    [0] => Array
        (
            [0] => مَجْل
            [1] => جميل
            [2] => جواد
        )

)

I found out that \p{M} will match teshkil, and \p{Arabic} will match an Arabic letter.

Upvotes: 0

Related Questions