PHP: Hebrew letters comparison

Question

I am trying to split Hebrew word into letters and get the index of a corresponding symbol. I have set the UTF-8 header and checked that the encoding of the files are actually UTF-8. But for some reason PHP is not able to make a correct comparison of the symbols and does not return a required symbol ID, while if I output the $text array it outputs it fine. I have an array of hebrew letters:

$id_symbols = array(
    280=>'א‬',
    281=>'בּ‬',
    282=>'ב‬',
    283=>'ג‬',
    284=>'ד‬',
    285=>'ה‬',
    286=>'ו‬',
    287=>'ז‬',
    288=>'ח‬',
    289=>'ט‬',
    290=>'י‬',
    291=>'כּ‬',
    292=>'כ‬',
    293=>'ךּ‬',
    294=>'ך‬',
    295=>'ל‬',
    296=>'מ‬',
    297=>'ם‬',
    298=>'נ‬',
    299=>'ן‬',
    300=>'ס‬',
    301=>'ע‬',
    302=>'פּ‬',
    303=>'פ‬',
    304=>'ף‬',
    305=>'צ‬',
    306=>'ץ‬',
    307=>'ק‬',
    308=>'ר‬',
    309=>'שׁ‬',
    310=>'שׂ‬',
    311=>'תּ‬',
    312=>'ת‬',
);

I send a post request to a page like this:

header('Content-type: text/html; charset=utf-8');

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL,"http://pr.animizer.net/word-api.php");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
            "api_key=some_key&text=מילה&font=arial&font_size=30&fore_color=000000&back_color=FFFFFF&template=1,2,3&speed=4");

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$server_output = curl_exec($ch);

curl_close ($ch);

exit($server_output);

Have received a POST request I am trying to get a key of each corresponding Hebrew letter:

function mb_str_split($string) {
$strlen = mb_strlen($string);
while ($strlen) {
    $array[] = mb_substr($string,0,1,"UTF-8");
    $string = mb_substr($string,1,$strlen,"UTF-8");
    $strlen = mb_strlen($string);
}
return $array;
}

$text = mb_str_split($_POST['text']); //splitting text into symbols

foreach($text as $t){

    foreach($id_symbols as $key=>$value){
        if($value == $t){
            $word[] = $key;
        }
    }

}



print_r($word);

and the output is

Array
(
)

P.S. Tried to output russian letters in the same way in the same files and they work fine. It doesn't look like the problem is the encoding

Justin T. · Accepted Answer

As @Rei has pointed out in his answer, there is an issue with your current symbols array. After trimming the symbols, I noticed that the seven (7) values that had more than one character had a standard character and one of three point characters:

HEBREW POINT DAGESH OR MAPIQ (ּ)
HEBREW POINT SHIN DOT (ׁ)
HEBREW POINT SIN DOT (ׂ)

I wrote some code that converts the Hebrew characters into their decimal numerical HTML encoding values. If one of the point values is encountered, it combines with the next character in the array to match one of your symbols. The following code is working well for me:

=0 && ord($c{0}) <= 127)
        return ord($c{0});
    if (ord($c{0}) >= 192 && ord($c{0}) <= 223)
        return (ord($c{0})-192)*64 + (ord($c{1})-128);
    if (ord($c{0}) >= 224 && ord($c{0}) <= 239)
        return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
    if (ord($c{0}) >= 240 && ord($c{0}) <= 247)
        return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
    if (ord($c{0}) >= 248 && ord($c{0}) <= 251)
        return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
    if (ord($c{0}) >= 252 && ord($c{0}) <= 253)
        return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
    if (ord($c{0}) >= 254 && ord($c{0}) <= 255)    //  error
        return FALSE;
    return 0;
}

function mb_str_split($string) {
    $strlen = mb_strlen($string);
    while ($strlen) {
        $array[] = mb_substr($string,-1,1,"UTF-8");
        $string = mb_substr($string,0,$strlen-1,"UTF-8");
        $strlen = mb_strlen($string);
    }
    return $array;
}

$hebrewText = $_POST['text'] //"מילה" used in example;

$text = mb_str_split($hebrewText); //splitting text into symbols

$word = [];

$lookupChrs = array(
    '1488'=>280,
    '14681489'=>281,
    '1489'=>282,
    '1490'=>283,
    '1491'=>284,
    '1492'=>285,
    '1493'=>286,
    '1494'=>287,
    '1495'=>288,
    '1496'=>289,
    '1497'=>290,
    '14681499'=>291,
    '1499'=>292,
    '14681498'=>293,
    '1498'=>294,
    '1500'=>295,
    '1502'=>296,
    '1501'=>297,
    '1504'=>298,
    '1503'=>299,
    '1505'=>300,
    '1506'=>301,
    '14681508'=>302,
    '1508'=>303,
    '1507'=>304,
    '1510'=>305,
    '1509'=>306,
    '1511'=>307,
    '1512'=>308,
    '14731513'=>309,
    '14741513'=>310,
    '14681514'=>311,
    '1514'=>312
    );

foreach($text as $t){
    $lookupChr = _uniord(array_shift($text));
    $lookupChr = (string)$lookupChr;
    //handle accents (two charactrers instead of one)
    if($lookupChr == "1468" || $lookupChr == "1473" || $lookupChr == "1474"){
        //accent detected, combine with next character
        //echo "\"" . $lookupChr . "\":\"" . _uniord(array_shift($text)) . "\"";
        $lookupChr .= _uniord(array_shift($text));
    }
    if($lookupChr != "0"){
        $word[] = $lookupChrs[$lookupChr];
    }
}

print_r($word);

//OUTPUT:
//    Array
//    (
//        [0] => 285
//        [1] => 295
//        [2] => 290
//        [3] => 296
//    )

PHP: Hebrew letters comparison

Answers (2)

Related Questions