Nail
Nail

Reputation: 143

Correctly convert string to UTF-8 via PHP

I have a file test.HIO its content this:

 11/08/2015 00:05:50»ЦО Ворота выход»Дверь не открыта»24001695»Бахром Суннатуллоевич Тургунов»99»»»
 11/08/2015 00:05:54»ЦО Ворота выход»Верный доступ»24001215»Шохрух Джохонгирович Исламов»99»»»

If i use linux command file -i test.HIO i get this info:

test.HI0: text/plain; charset=iso-8859-1

If i convert this file use php function iconv or mb_convert_encoding:

$file_content = file( "test.HIO" );

// for example i get one line from file
$str = iconv( "ISO-8859-1", "UTF-8", $file_content[2] );
var_dump( $str );

$str2 = mb_convert_encoding( $file_content[2], "UTF-8", "ISO-8859-1" );
var_dump( $str2 );

I get the same result:

 string(159) " 11/08/2015 00:05:45»ÖÎ Âîðîòà âûõîä»Âåðíûé äîñòóï»24001695»Áàõðîì Ñóííàòóëëîåâè÷ Òóðãóíîâ»99»»» "

If i just show file content in browser:

echo '<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />';
$file_content = file( "test.HI0" );

echo $file_content[2];

i see this:

11/08/2015 00:07:17��� 2 ����������� �������24001066��������� ���������� �������99���

How correctly show or save text in UTF-8 encode?
Thank in anvance.

UPD.

Thank to all. I find another solution it looks ugly, but working.

$file_content = file( "test.HIO" );

$str = iconv( "ISO-8859-1", "UTF-8", $file_content[2] );

// OR

$str = mb_convert_encoding( $file_content[2], "UTF-8", "ISO-8859-1" );

$str = iconv( 'utf-8', 'windows-1252', $str );
$str = iconv( 'windows-1251', 'utf-8', $str );

var_dump( $str );


UPD 2.

I chose the wrong way using file -i for detect file encoding.
As it turned out, my file encoding is windows-1251

chardet /home/file.HI0
/home/file.HI0: windows-1251 (confidence: 0.75)

or @yangsunny advice enca

enca -L ru /home/file.HI0
MS-Windows code page 1251

Eventually, can be used this code:

$file_content = file( "test.HIO" );

$str2 = mb_convert_encoding( $file_content[2], "UTF-8", "windows-1251" );
var_dump( $str2 );

Thank all for help.

Upvotes: 0

Views: 1687

Answers (1)

&#193;lvaro Gonz&#225;lez
&#193;lvaro Gonz&#225;lez

Reputation: 146630

You are doing conversions the right way. The problem is that you don't know the source encoding. For example, think of currency conversion: you can convert £100 or ¥100 to US dollars. But you can't convert just "100".

From Wikipedia (emphasis mine):

ISO/IEC 8859-1:1998 [...] is generally intended for Western European languages (see below for a list).

It's clear that Cyrillic text (Russian, Ukrainian or whatever) cannot be ISO-8859-1, an encoding that only has characters from the Latin alphabet.

Correct text encoding detection is a manual task. If you know for sure the text is Cyrillic, you need to do some research to find out what encodings support Cyrillic and then figure out which one better matches your data. You might need to get actual hexadecimal values. Even then, there's still room for error. For instance, there might be encodings that are identical for 99% of characters but differ for the remaining 1%.

Upvotes: 2

Related Questions