How to fix UTF-8 encoding error with Russian words

Question

My Perl script reads from an text file which contains mainly English ANSI words. But there are Russian words sometimes, which I can not convert back to UTF-8.

See same example (the words in brackets are the English translations):

Êîìïîíåíò (Component)
Àâòîð (Author)
Ãýíäàëüô (Gandalf)
Äàòà ñîçäàíèÿ (Create date): 20-ìàé(may)-2003
Äàòà êîððåêöèè (Last correction Date): 25-ìàð(mar)-2003
Âåðñèÿ (Version): 0.92
Áëàãîäàðíîñòè (Thanks):
Íîâîå â (New in):
Ïîääåðæêà (Support)
Î÷åíü ìíîãî (Very much)

I've read the UTF-8 Encoding Debugging Chart and tried also the following

$s='Àâòîð';
from_to($s, "iso-8859-5","utf-8");  
print "$s
";

my $s = Encode::decode( 'iso-8859-5', 'Àâòîð' );
from_to($s, "iso-8859-5","utf-8");  
print "$s
";

I've tried also cp1252 instead of iso-8859-5, but nothing helps. I've tried also Encode::Guess, but the result is not helpful: iso-8859-5 or cp1251 or koi8-r or iso-8859-1.

Any idea how to convert 'Àâòîð' back to the Cyrillic text 'автор'?

Corion · Accepted Answer

After some tries, I get the expected output Автор when switching the (Windows) console code page to 65001 (UTF-8) and decoding the input data from Windows-1251:

perl -MEncode -wle "print encode('UTF-8',decode('Windows-1251',shift))" "Àâòîð"

This suggests that the input data is encoded as Windows-1251 and decoding from that should give you the cyrrillic letters in Unicode. To output the data to a file, make sure you either set the encoding when opening the file (easiest) or encode each string to the target encoding on output (hard to keep track of):

my $octets = <$input_file>;

my $data = decode('Windows-1251', $octets;
open my $fh, '>:encoding(UTF-8)', $filename
    or die "Couldn't write to $filename: $!";
print $fh decode('Windows-1251', $data);

How to fix UTF-8 encoding error with Russian words

Answers (2)

Related Questions