Reputation: 408
My Perl script reads from an text file which contains mainly English ANSI words. But there are Russian words sometimes, which I can not convert back to UTF-8.
See same example (the words in brackets are the English translations):
Êîìïîíåíò (Component)
Àâòîð (Author)
Ãýíäàëüô (Gandalf)
Äàòà ñîçäàíèÿ (Create date): 20-ìàé(may)-2003
Äàòà êîððåêöèè (Last correction Date): 25-ìàð(mar)-2003
Âåðñèÿ (Version): 0.92
Áëàãîäàðíîñòè (Thanks):
Íîâîå â (New in):
Ïîääåðæêà (Support)
Î÷åíü ìíîãî (Very much)
I've read the UTF-8 Encoding Debugging Chart and tried also the following
$s='Àâòîð';
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
my $s = Encode::decode( 'iso-8859-5', 'Àâòîð' );
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
I've tried also cp1252
instead of iso-8859-5
, but nothing helps.
I've tried also Encode::Guess, but the result is not helpful: iso-8859-5 or cp1251 or koi8-r or iso-8859-1
.
Any idea how to convert 'Àâòîð' back to the Cyrillic text 'автор'?
Upvotes: 0
Views: 6175
Reputation: 1102
Your bytes sequence is 0xc0 0xe2 0xf2 0xee 0xf0. This is russian word 'author' in cp1251. Representation given by you can be get if your application assumes that this is cp1252 encoding. Now the question is here what codepage do you like to have? Or, what codepage needed to your application?
To read file in cp1251 in correct way you have to use construction like this:
open (my $tmp_h,"<:encoding(cp-1251)", $ARGV[0]) or die $!;
That allows perl to know what codepage do you use in your file. And then when you will read file into string it allows perl to correctly convert values from cp1251 to Perl's internal form (UTF-8) and use these string as you want without any problems.
For internal form perl set UTF8 flag you can check using Devel::Peek module.
I think, that using internal form also will give you chance to use any string operation correctly and will help avoid mistakes.
I would recommend to use "use utf8" pragma in our source code. Now, all literals in the source code will be threated as utf8 and automatically converted into internal form correctly. Now, we know that our source code is in UTF8 (and it would also better if with BOM, because detecting BOM usualy is the first thing different IDE and editor will typical do). Later, we can open other files in any encoding using "<:encoding(....)" construction get data from the web, from the databases and again make sure that data were converted into internal form correctly checking utf8 flag. Having all this, we would be able to work with all this data in one manner, correcly compare string, use regular expression and so on.
Upvotes: 1
Reputation: 3925
After some tries, I get the expected output Автор
when switching the (Windows) console code page to 65001 (UTF-8) and decoding the input data from Windows-1251
:
perl -MEncode -wle "print encode('UTF-8',decode('Windows-1251',shift))" "Àâòîð"
This suggests that the input data is encoded as Windows-1251
and decoding from that should give you the cyrrillic letters in Unicode. To output the data to a file, make sure you either set the encoding when opening the file (easiest) or encode
each string to the target encoding on output (hard to keep track of):
my $octets = <$input_file>;
my $data = decode('Windows-1251', $octets;
open my $fh, '>:encoding(UTF-8)', $filename
or die "Couldn't write to $filename: $!";
print $fh decode('Windows-1251', $data);
Upvotes: 2