Reputation: 761
I am trying to split Hebrew word into letters and get the index of a corresponding symbol. I have set the UTF-8 header and checked that the encoding of the files are actually UTF-8. But for some reason PHP is not able to make a correct comparison of the symbols and does not return a required symbol ID, while if I output the $text array it outputs it fine. I have an array of hebrew letters:
$id_symbols = array(
280=>'א',
281=>'בּ',
282=>'ב',
283=>'ג',
284=>'ד',
285=>'ה',
286=>'ו',
287=>'ז',
288=>'ח',
289=>'ט',
290=>'י',
291=>'כּ',
292=>'כ',
293=>'ךּ',
294=>'ך',
295=>'ל',
296=>'מ',
297=>'ם',
298=>'נ',
299=>'ן',
300=>'ס',
301=>'ע',
302=>'פּ',
303=>'פ',
304=>'ף',
305=>'צ',
306=>'ץ',
307=>'ק',
308=>'ר',
309=>'שׁ',
310=>'שׂ',
311=>'תּ',
312=>'ת',
);
I send a post request to a page like this:
header('Content-type: text/html; charset=utf-8');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://pr.animizer.net/word-api.php");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"api_key=some_key&text=מילה&font=arial&font_size=30&fore_color=000000&back_color=FFFFFF&template=1,2,3&speed=4");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
curl_close ($ch);
exit($server_output);
Have received a POST request I am trying to get a key of each corresponding Hebrew letter:
function mb_str_split($string) {
$strlen = mb_strlen($string);
while ($strlen) {
$array[] = mb_substr($string,0,1,"UTF-8");
$string = mb_substr($string,1,$strlen,"UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
$text = mb_str_split($_POST['text']); //splitting text into symbols
foreach($text as $t){
foreach($id_symbols as $key=>$value){
if($value == $t){
$word[] = $key;
}
}
}
print_r($word);
and the output is
Array
(
)
P.S. Tried to output russian letters in the same way in the same files and they work fine. It doesn't look like the problem is the encoding
Upvotes: 0
Views: 777
Reputation: 846
As @Rei has pointed out in his answer, there is an issue with your current symbols array. After trimming the symbols, I noticed that the seven (7) values that had more than one character had a standard character and one of three point characters:
ּ
)ׁ
)ׂ
)I wrote some code that converts the Hebrew characters into their decimal numerical HTML encoding values. If one of the point values is encountered, it combines with the next character in the array to match one of your symbols. The following code is working well for me:
<?php
function _uniord($c) {
if (ord($c{0}) >=0 && ord($c{0}) <= 127)
return ord($c{0});
if (ord($c{0}) >= 192 && ord($c{0}) <= 223)
return (ord($c{0})-192)*64 + (ord($c{1})-128);
if (ord($c{0}) >= 224 && ord($c{0}) <= 239)
return (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
if (ord($c{0}) >= 240 && ord($c{0}) <= 247)
return (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
if (ord($c{0}) >= 248 && ord($c{0}) <= 251)
return (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
if (ord($c{0}) >= 252 && ord($c{0}) <= 253)
return (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
if (ord($c{0}) >= 254 && ord($c{0}) <= 255) // error
return FALSE;
return 0;
}
function mb_str_split($string) {
$strlen = mb_strlen($string);
while ($strlen) {
$array[] = mb_substr($string,-1,1,"UTF-8");
$string = mb_substr($string,0,$strlen-1,"UTF-8");
$strlen = mb_strlen($string);
}
return $array;
}
$hebrewText = $_POST['text'] //"מילה" used in example;
$text = mb_str_split($hebrewText); //splitting text into symbols
$word = [];
$lookupChrs = array(
'1488'=>280,
'14681489'=>281,
'1489'=>282,
'1490'=>283,
'1491'=>284,
'1492'=>285,
'1493'=>286,
'1494'=>287,
'1495'=>288,
'1496'=>289,
'1497'=>290,
'14681499'=>291,
'1499'=>292,
'14681498'=>293,
'1498'=>294,
'1500'=>295,
'1502'=>296,
'1501'=>297,
'1504'=>298,
'1503'=>299,
'1505'=>300,
'1506'=>301,
'14681508'=>302,
'1508'=>303,
'1507'=>304,
'1510'=>305,
'1509'=>306,
'1511'=>307,
'1512'=>308,
'14731513'=>309,
'14741513'=>310,
'14681514'=>311,
'1514'=>312
);
foreach($text as $t){
$lookupChr = _uniord(array_shift($text));
$lookupChr = (string)$lookupChr;
//handle accents (two charactrers instead of one)
if($lookupChr == "1468" || $lookupChr == "1473" || $lookupChr == "1474"){
//accent detected, combine with next character
//echo "\"" . $lookupChr . "\":\"" . _uniord(array_shift($text)) . "\"";
$lookupChr .= _uniord(array_shift($text));
}
if($lookupChr != "0"){
$word[] = $lookupChrs[$lookupChr];
}
}
print_r($word);
//OUTPUT:
// Array
// (
// [0] => 285
// [1] => 295
// [2] => 290
// [3] => 296
// )
Upvotes: 1
Reputation: 6363
The problem with your code is the symbols array.
The final part of your code tries to match 1 symbol (character) to the elements in $id_symbols
.
The problem is none of those elements are 1 symbol.
They are either 2 or 3 symbols each and therefore they will never match.
This code will show you.
foreach($id_symbols as $key => $value) {
echo $key.' '.$value.' '.json_encode($value)."\n";
}
Output:
280 א "\u05d0\u202c"
281 בּ "\u05d1\u05bc\u202c"
282 ב "\u05d1\u202c"
283 ג "\u05d2\u202c"
284 ד "\u05d3\u202c"
285 ה "\u05d4\u202c"
286 ו "\u05d5\u202c"
287 ז "\u05d6\u202c"
288 ח "\u05d7\u202c"
289 ט "\u05d8\u202c"
290 י "\u05d9\u202c"
291 כּ "\u05db\u05bc\u202c"
292 כ "\u05db\u202c"
293 ךּ "\u05da\u05bc\u202c"
294 ך "\u05da\u202c"
295 ל "\u05dc\u202c"
296 מ "\u05de\u202c"
297 ם "\u05dd\u202c"
298 נ "\u05e0\u202c"
299 ן "\u05df\u202c"
300 ס "\u05e1\u202c"
301 ע "\u05e2\u202c"
302 פּ "\u05e4\u05bc\u202c"
303 פ "\u05e4\u202c"
304 ף "\u05e3\u202c"
305 צ "\u05e6\u202c"
306 ץ "\u05e5\u202c"
307 ק "\u05e7\u202c"
308 ר "\u05e8\u202c"
309 שׁ "\u05e9\u05c1\u202c"
310 שׂ "\u05e9\u05c2\u202c"
311 תּ "\u05ea\u05bc\u202c"
312 ת "\u05ea\u202c"
There should be only one backslash each but all of them have 2 or 3.
First problem, they are all terminated by \u202c
.
The solution for this problem is easy: just remove them.
Second problem, even after removing all the \u202c
, there are still 7 elements that are 2 symbols wide.
They are 281, 291, 293, 302, 309, 310, 311.
The solution for this problem: they must be replaced with their single symbol versions.
For instance, element of index 293 is \u05da\u05bc
and it can be replaced with \ufb3a
.
See https://codepoints.net/U+FB3A
I trust you can deal with the remaining 6 symbols.
Upvotes: 2