stack
stack

Reputation: 10228

Why the pattern matches one word while there is two identical word?

Please take a look at this:

enter image description here

as you see there is just one matched in the regex101, but the browser matches two words which are identical. So why regex101 cannot match the second word? Anyway I need to match both words (or more if exists).

Noted that it isn't related to g flag. Because I've used it in the fiddle.

Here is the fiddle

Upvotes: 4

Views: 93

Answers (2)

revo
revo

Reputation: 48751

Dealing with such a text is hard for later use. You have to find different representation of each letter to change search word from مجلس to something else like احمدی نژاد according to @Wiktor's solution.

That's why normalization process comes handy:

Normalization is a process that involves transforming characters and sequences of characters into a formally-defined underlying representation. This process is most important when text needs to be compared for sorting and searching, but it is also used when storing text to ensure that the text is stored in a consistent representation.

We need to normalize our input string at the very first place using Normalizer::normalize() then without any change in Regular Expression, safely we can run a preg_match_all over it:

<?php

$text = <<< 'STR'
یک نماینده مجلس عنوان کرد: ﺩﺭ ﺩﻭﺭﻩ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﻣﺮﺩﻡ ﺩﺭ
ﺭﻓﺎﻩ ﺑﻮﺩﻧﺪ !/دولت سابق تنها دولتی که پس از انقلاب به مردم خدمت کرد! ﻳﻚ
ﻧﻤﺎﯾﻨﺪﻩ ﮔﺮﻭﻩ ﭘﺎﻳﺪﺍﺭی دﺭ ﻣﺠﻠﺲ ﺷﻮﺭﺍﯼ ﺍﺳﻼﻣﯽ ﺩﺭ ﭘﺎﺳﺦ ﺑﻪ ﺳﺆﺍﻟﯽ ﺩﺭ ﻣﻮﺭﺩ
ﺑﺎﺯﮔﺸﺖ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﻪ ﻋﺮﺻﻪ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺍﻇﻬﺎﺭ ﺩﺍﺷﺖ : ﻣﺎ ﺍﻣﯿﺪﻭﺍﺭﯾﻢ ﺍﯾﻦ ﺍﺗﻔﺎﻕ
ﺑﯿﻔﺘﺪ ﻭ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﺮﺍﯼ ﺷﺮﮐﺖ ﺩﺭ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺣﺎﺿﺮ ﺷﻮﺩ چرا که دولت وی تنها
دولتی است که پس از انقلاب به مردم خدمت کرده است.
STR;


$normalizedText = normalizer_normalize( $text , Normalizer::NFKC );
preg_match_all('~مجلس~', $normalizedText, $matches);

print_r($matches);

Outputs:

Array
(
    [0] => Array
        (
            [0] => مجلس
            [1] => مجلس
        )

)

Note: it needs php_intl.dll extension to be enabled.

Live demo

Upvotes: 3

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627190

The words are written with different chars that look the same but have different Unicode codes.

  • The first word: \uFEE3\uFEA0\uFEE0\uFEB2
  • The one not matched (with ... FORM in the name): \u0645\u062C\u0644\u0633

Here are the codes:

‎FEE3  ARABIC LETTER MEEM INITIAL FORM
‎0645  ARABIC LETTER MEEM

‎FEA0  ARABIC LETTER JEEM MEDIAL FORM
‎062C  ARABIC LETTER JEEM

‎FEE0  ARABIC LETTER LAM MEDIAL FORM
‎0644  ARABIC LETTER LAM

‎FEB2  ARABIC LETTER SEEN FINAL FORM
‎0633  ARABIC LETTER SEEN

You cannot match both with a literal representation of either words, you either need to use an alternation with the two/all variants, or use character classes for those chars:

[\x{FEE3}\x{0645}][\x{FEA0}\x{062C}][\x{FEE0}\x{0644}][\x{FEB2}\x{0633}]

See the regex demo.

A PHP demo:

$re = '/[\x{FEE3}\x{0645}][\x{FEA0}\x{062C}][\x{FEE0}\x{0644}][\x{FEB2}\x{0633}]/u';
$str = 'یک نماینده مجلس عنوان کرد: ﺩﺭ ﺩﻭﺭﻩ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﻣﺮﺩﻡ ﺩﺭ ﺭﻓﺎﻩ ﺑﻮﺩﻧﺪ !/دولت سابق تنها دولتی که پس از انقلاب به مردم خدمت کرد! ﻳﻚ ﻧﻤﺎﯾﻨﺪﻩ ﮔﺮﻭﻩ ﭘﺎﻳﺪﺍﺭی دﺭ ﻣﺠﻠﺲ ﺷﻮﺭﺍﯼ ﺍﺳﻼﻣﯽ ﺩﺭ ﭘﺎﺳﺦ ﺑﻪ ﺳﺆﺍﻟﯽ ﺩﺭ ﻣﻮﺭﺩ ﺑﺎﺯﮔﺸﺖ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﻪ ﻋﺮﺻﻪ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺍﻇﻬﺎﺭ ﺩﺍﺷﺖ : ﻣﺎ ﺍﻣﯿﺪﻭﺍﺭﯾﻢ ﺍﯾﻦ ﺍﺗﻔﺎﻕ ﺑﯿﻔﺘﺪ ﻭ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﺮﺍﯼ ﺷﺮﮐﺖ ﺩﺭ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺣﺎﺿﺮ ﺷﻮﺩ چرا که دولت وی تنها دولتی است که پس از انقلاب به مردم خدمت کرده است.';
preg_match_all($re, $str, $matches);
print_r($matches[0]);

Output:

Array
(
    [0] => مجلس
    [1] => ﻣﺠﻠﺲ
)

Upvotes: 1

Related Questions