Reputation: 183

Almost the same code, but different output, why?

I'm working with UTF-8 encoded text files,and cant find proper solution...

after I couldn't solve a problem with string, I'm trying fgetc() now, but it also doesn't work. This code:

$file = fopen("t1.txt","r+");
while (! feof ($file))
{
  $c= fgetc($file);
  echo $c;
  //echo "\t";
}
fclose($file);

works fine, and outputs: abcd абвқ efg but if I uncomment the //echo "\t", it doesn't work, it outputs: � � � a b c d � � � � � � � � e f g

why? how to fix it?

Upvotes: 2

Answers (2)

Esailija

Reputation: 140220

You are reading the file byte at a time.

For example the character б encodes as bytes 0xD0 0xB1 in UTF-8. The tab character is 0x09.

So without the tab character, you first write 0xD0, then 0xB1, resulting in 0xD0 0xB1 which is valid UTF-8.

With the tab character, you write 0x09 between every byte - making it: 0xD0 0x09 0xB1. 0xD0 followed by 0x09 is not valid UTF-8, so the browser renders the replacement character to deal with it.

You need to be more sophisticated about it; this should work:

$file = fopen("t1.txt","r+");
while (! feof ($file))
{
  $c = fgetc($file);
  $val = ord($c);

  //UTF-8 Lead Byte
  if( $val & 0x80 ) {
    $continuationByteCount = 0;
    if( ($val & 0xF8) == 0xF0) $continuationByteCount = 3;
    else if( ($val & 0xF0) == 0xE0) $continuationByteCount = 2;
    else if( ($val & 0xE0) == 0xC0) $continuationByteCount = 1;

    echo $c;

    while( $continuationByteCount-- ) {
        echo fgetc($file);
    }

  }
  else { //Single-byte UTF-8 unit... I.E. ASCII
      echo $c;
  }
  echo "\t";
}

fclose($file);

Read it all at once and split to array where each item is 1 character (1-4 bytes):

$chars = preg_split( '//u', file_get_contents("t1.txt"), -1, PREG_SPLIT_NO_EMPTY );

foreach( $chars as $char ) {
    echo $char;
    echo "\t";
}

Upvotes: 3

Philipp

Reputation: 15629

I think this might be a problem with the encoding recognition from the browser. You can try

<?php
header('Content-type: text/html; charset=utf-8');
?>

Or set the meta tag

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Upvotes: 0

Almost the same code, but different output, why?

Answers (2)

Related Questions