Polar Bear
Polar Bear

Reputation: 6798

Perl :: How work with Cyrillic coding in Windows 10?

I try to figure out how in Windows 10 with perl script to read an argument coded in cyrillic (cp437) and store it in text file encoded with utf8.

In the console chcp command returns cp437 code page.

Search on StackOverflow returned several question of similar nature. I've attempted to utilize knowledge obtained from these posts but without success.

An examples demonstrating:

would be greatly appreciated.

NOTE: console input (cp437) to output (cp1251) is purely for demonstration what it involves and how it is done properly.

UPDATE: cp437 does not include Cyrillic symbols, Perl uses ANSI system calls [CreateFileA] and can not pass Cyrillic characters into Perl without additional workaround. Default codepage for my system is cp1252 which does not cover Cyrillic symbols.

Upvotes: 0

Views: 522

Answers (1)

ikegami
ikegami

Reputation: 385917

The command line can be obtained from the OS using the "ANSI" interface or using the "Wide" interface.

The ANSI interface uses text encoded using the active code page.

The Wide interface uses text encoded using UTF-16le.

Perl uses the ANSI interface (though you could access the Wide interface through Win32:API, for example).

use Encode qw( decode );
use Win32  qw( );

my $acp = "cp".Win32::GetACP();

@ARGV = map { decode($acp, $_) } @ARGV;

open(my $fh, '>:encoding(UTF-8)', $qfn)
   or die("Can't create \"$qfn\": $!\n");

print($fh "$_\n") for @ARGV;

It's important to note that the encoding used by the console (as shown by chcp) is not the same as the active code page. What this means is that @ARGV can only contains characters that are in both the OEM code page (the encoding used by the console) and the active code page (the encoding used by the ANSI interface).

The remove this limitation, one would use the wide interface of the system call to get the arguments from the command line (GetCommandLineW) and the wide interface of the system call to parse the command line (CommandLineToArgvW). This would provide the arguments no matter what encoding the console uses. With code page 65001 being used in the console, this allows any Unicode character to be used in arguments.

This page contains Perl code to make those system calls.


Related reading

Upvotes: 1

Related Questions