Arshad KM
Arshad KM

Reputation: 133

How can Detect UTF 16 decoding

I have to read a file and identify its decoding type, I used mb_detect_encoding() to detect utf-16 but am getting wrong result.. how can i detectutf-16 encoding type in php.

Php file is utf-16 and my header was windows-1256 ( because of Arabic)

header('Content-Type: text/html; charset=windows-1256');

$delimiter = '\t';
$f= file("$fileName");

 foreach($f as $dailystatmet)
{
    $transactionData = str_replace("'", '', $dailystatmet);
    preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);

        array_push($matchesz, $matches[0]);


}

$searchKeywords = array ("apple", "orange", 'mango');

$rowCount = count($matchesz);

for ($row = 1; $row <= $rowCount; $row++) {
    $myRow = $row;
    $cell = $matchesz[$row];



    foreach ($searchKeywords as $val) {

        if (partialArraySearch($cell[$c_description], $val)) {

          }
       }}



function partialArraySearch($cell, $searchword)
{

    if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {

        return true;
    }

    return false;
}

Above code is for search with in the uploaded file.. if the file was in utf-8 then match was getting but when same file with utf-16 or utf-32 am not getting the result..

so how can i get the encoding type of uploaded file ..

Upvotes: 1

Views: 1697

Answers (2)

caiofior
caiofior

Reputation: 429

My solution is to detect UTF-16 and convert the code in Latin 15 is

  preg_match_all('/\x00/',$content,$count);
  if(count($count[0])/strlen($content)>0.4) {
     $content = iconv('UTF-16', 'ISO-8859-15', $content);
  }

In other words i check the frequency of the hexadecimal character 00. If it is higher than 0.4 probably the text contains characters in the base set encoded in UTF-16. This means two bytes for character but usually the second byte is 00.

Upvotes: 1

Lars Moelleken
Lars Moelleken

Reputation: 728

If someone is still searching for a solution, I have hacked something like this in the "voku/portable-utf8" repo on github. => "UTF8::file_get_contents()"

The "file_get_contents"-wrapper will detect the current encoding via "UTF8::str_detect_encoding()" and will convert the content of the file automatically into UTF-8.

e.g.: from the PHPUnit tests ...

$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

Upvotes: 1

Related Questions