Reputation: 133
I have to read a file and identify its decoding type, I used mb_detect_encoding()
to detect utf-16
but am getting wrong result.. how can i detectutf-16
encoding type in php.
Php file is utf-16 and my header was windows-1256 ( because of Arabic)
header('Content-Type: text/html; charset=windows-1256');
$delimiter = '\t';
$f= file("$fileName");
foreach($f as $dailystatmet)
{
$transactionData = str_replace("'", '', $dailystatmet);
preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);
array_push($matchesz, $matches[0]);
}
$searchKeywords = array ("apple", "orange", 'mango');
$rowCount = count($matchesz);
for ($row = 1; $row <= $rowCount; $row++) {
$myRow = $row;
$cell = $matchesz[$row];
foreach ($searchKeywords as $val) {
if (partialArraySearch($cell[$c_description], $val)) {
}
}}
function partialArraySearch($cell, $searchword)
{
if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {
return true;
}
return false;
}
Above code is for search with in the uploaded file.. if the file was in utf-8 then match was getting but when same file with utf-16 or utf-32 am not getting the result..
so how can i get the encoding type of uploaded file ..
Upvotes: 1
Views: 1697
Reputation: 429
My solution is to detect UTF-16 and convert the code in Latin 15 is
preg_match_all('/\x00/',$content,$count);
if(count($count[0])/strlen($content)>0.4) {
$content = iconv('UTF-16', 'ISO-8859-15', $content);
}
In other words i check the frequency of the hexadecimal character 00. If it is higher than 0.4 probably the text contains characters in the base set encoded in UTF-16. This means two bytes for character but usually the second byte is 00.
Upvotes: 1
Reputation: 728
If someone is still searching for a solution, I have hacked something like this in the "voku/portable-utf8" repo on github. => "UTF8::file_get_contents()"
The "file_get_contents"-wrapper will detect the current encoding via "UTF8::str_detect_encoding()" and will convert the content of the file automatically into UTF-8.
e.g.: from the PHPUnit tests ...
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);
Upvotes: 1