Reputation: 10228
Please take a look at this:
as you see there is just one matched in the regex101, but the browser matches two words which are identical. So why regex101 cannot match the second word? Anyway I need to match both words (or more if exists).
Noted that it isn't related to g
flag. Because I've used it in the fiddle.
Here is the fiddle
Upvotes: 4
Views: 93
Reputation: 48751
Dealing with such a text is hard for later use. You have to find different representation of each letter to change search word from مجلس
to something else like احمدی نژاد
according to @Wiktor's solution.
That's why normalization process comes handy:
Normalization is a process that involves transforming characters and sequences of characters into a formally-defined underlying representation. This process is most important when text needs to be compared for sorting and searching, but it is also used when storing text to ensure that the text is stored in a consistent representation.
We need to normalize our input string at the very first place using Normalizer::normalize()
then without any change in Regular Expression, safely we can run a preg_match_all
over it:
<?php
$text = <<< 'STR'
یک نماینده مجلس عنوان کرد: ﺩﺭ ﺩﻭﺭﻩ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﻣﺮﺩﻡ ﺩﺭ
ﺭﻓﺎﻩ ﺑﻮﺩﻧﺪ !/دولت سابق تنها دولتی که پس از انقلاب به مردم خدمت کرد! ﻳﻚ
ﻧﻤﺎﯾﻨﺪﻩ ﮔﺮﻭﻩ ﭘﺎﻳﺪﺍﺭی دﺭ ﻣﺠﻠﺲ ﺷﻮﺭﺍﯼ ﺍﺳﻼﻣﯽ ﺩﺭ ﭘﺎﺳﺦ ﺑﻪ ﺳﺆﺍﻟﯽ ﺩﺭ ﻣﻮﺭﺩ
ﺑﺎﺯﮔﺸﺖ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﻪ ﻋﺮﺻﻪ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺍﻇﻬﺎﺭ ﺩﺍﺷﺖ : ﻣﺎ ﺍﻣﯿﺪﻭﺍﺭﯾﻢ ﺍﯾﻦ ﺍﺗﻔﺎﻕ
ﺑﯿﻔﺘﺪ ﻭ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﺮﺍﯼ ﺷﺮﮐﺖ ﺩﺭ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺣﺎﺿﺮ ﺷﻮﺩ چرا که دولت وی تنها
دولتی است که پس از انقلاب به مردم خدمت کرده است.
STR;
$normalizedText = normalizer_normalize( $text , Normalizer::NFKC );
preg_match_all('~مجلس~', $normalizedText, $matches);
print_r($matches);
Outputs:
Array
(
[0] => Array
(
[0] => مجلس
[1] => مجلس
)
)
Note: it needs php_intl.dll
extension to be enabled.
Upvotes: 3
Reputation: 627190
The words are written with different chars that look the same but have different Unicode codes.
\uFEE3\uFEA0\uFEE0\uFEB2
... FORM
in the name): \u0645\u062C\u0644\u0633
Here are the codes:
FEE3 ARABIC LETTER MEEM INITIAL FORM
0645 ARABIC LETTER MEEM
FEA0 ARABIC LETTER JEEM MEDIAL FORM
062C ARABIC LETTER JEEM
FEE0 ARABIC LETTER LAM MEDIAL FORM
0644 ARABIC LETTER LAM
FEB2 ARABIC LETTER SEEN FINAL FORM
0633 ARABIC LETTER SEEN
You cannot match both with a literal representation of either words, you either need to use an alternation with the two/all variants, or use character classes for those chars:
[\x{FEE3}\x{0645}][\x{FEA0}\x{062C}][\x{FEE0}\x{0644}][\x{FEB2}\x{0633}]
See the regex demo.
A PHP demo:
$re = '/[\x{FEE3}\x{0645}][\x{FEA0}\x{062C}][\x{FEE0}\x{0644}][\x{FEB2}\x{0633}]/u';
$str = 'یک نماینده مجلس عنوان کرد: ﺩﺭ ﺩﻭﺭﻩ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﻣﺮﺩﻡ ﺩﺭ ﺭﻓﺎﻩ ﺑﻮﺩﻧﺪ !/دولت سابق تنها دولتی که پس از انقلاب به مردم خدمت کرد! ﻳﻚ ﻧﻤﺎﯾﻨﺪﻩ ﮔﺮﻭﻩ ﭘﺎﻳﺪﺍﺭی دﺭ ﻣﺠﻠﺲ ﺷﻮﺭﺍﯼ ﺍﺳﻼﻣﯽ ﺩﺭ ﭘﺎﺳﺦ ﺑﻪ ﺳﺆﺍﻟﯽ ﺩﺭ ﻣﻮﺭﺩ ﺑﺎﺯﮔﺸﺖ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﻪ ﻋﺮﺻﻪ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺍﻇﻬﺎﺭ ﺩﺍﺷﺖ : ﻣﺎ ﺍﻣﯿﺪﻭﺍﺭﯾﻢ ﺍﯾﻦ ﺍﺗﻔﺎﻕ ﺑﯿﻔﺘﺪ ﻭ ﺍﺣﻤﺪﯼﻧﮋﺍﺩ ﺑﺮﺍﯼ ﺷﺮﮐﺖ ﺩﺭ ﺍﻧﺘﺨﺎﺑﺎﺕ ﺣﺎﺿﺮ ﺷﻮﺩ چرا که دولت وی تنها دولتی است که پس از انقلاب به مردم خدمت کرده است.';
preg_match_all($re, $str, $matches);
print_r($matches[0]);
Output:
Array
(
[0] => مجلس
[1] => ﻣﺠﻠﺲ
)
Upvotes: 1