Reputation: 183
I'm working with UTF-8 encoded text files,and cant find proper solution...
after I couldn't solve a problem with string, I'm trying fgetc() now, but it also doesn't work. This code:
$file = fopen("t1.txt","r+");
while (! feof ($file))
{
$c= fgetc($file);
echo $c;
//echo "\t";
}
fclose($file);
works fine, and outputs: abcd абвқ efg but if I uncomment the //echo "\t", it doesn't work, it outputs: � � � a b c d � � � � � � � � e f g
why? how to fix it?
Upvotes: 2
Views: 119
Reputation: 140220
You are reading the file byte at a time.
For example the character б
encodes as bytes 0xD0 0xB1
in UTF-8. The tab character is 0x09
.
So without the tab character, you first write 0xD0
, then 0xB1
, resulting in 0xD0 0xB1
which is valid UTF-8.
With the tab character, you write 0x09
between every byte - making it: 0xD0 0x09 0xB1
. 0xD0
followed by 0x09
is not
valid UTF-8, so the browser renders the replacement character to deal with it.
You need to be more sophisticated about it; this should work:
$file = fopen("t1.txt","r+");
while (! feof ($file))
{
$c = fgetc($file);
$val = ord($c);
//UTF-8 Lead Byte
if( $val & 0x80 ) {
$continuationByteCount = 0;
if( ($val & 0xF8) == 0xF0) $continuationByteCount = 3;
else if( ($val & 0xF0) == 0xE0) $continuationByteCount = 2;
else if( ($val & 0xE0) == 0xC0) $continuationByteCount = 1;
echo $c;
while( $continuationByteCount-- ) {
echo fgetc($file);
}
}
else { //Single-byte UTF-8 unit... I.E. ASCII
echo $c;
}
echo "\t";
}
fclose($file);
Read it all at once and split to array where each item is 1 character (1-4 bytes):
$chars = preg_split( '//u', file_get_contents("t1.txt"), -1, PREG_SPLIT_NO_EMPTY );
foreach( $chars as $char ) {
echo $char;
echo "\t";
}
Upvotes: 3
Reputation: 15629
I think this might be a problem with the encoding recognition from the browser. You can try
<?php
header('Content-type: text/html; charset=utf-8');
?>
Or set the meta tag
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Upvotes: 0