ssr1012
ssr1012

Reputation: 2589

How to open a file contains CJK characters in perl

How can I open the file contains CJK characters using perl script:

use utf8;
use open ':encoding(utf8)';
binmode STDOUT, ':utf8';

The above code I am using to open the file and I couldn't find the CJK characters and its empty. Input file contains text:

\para[para]Details of the electronic structure of the metal the loss to
electron-hole-pair excitations \characters{刘安雯} only 
depends weakly on the metal. The metals exhibit large variation in the 
work function, yet the translational in elasticity is similar in all 
cases. This suggests electron transfer forming a transient 
H\textsuperscript{{\textminus}} is not important. The simulation allows 
us to construct \characters{胡水明} a universal sticking 
function for H and D on metals, which depends only on the H atom 
incidence translational energy and incidence angle as well as the mass of 
the solid's atoms.\endp

I am finding this way:

while($str=m/(\p{InCJK_Unified_Ideographs})/xg)
{
     print "Char: --> $&\n";
}

Could someone guide where I am doing wrong in my code: Thanks.

Updated:

I don't know but this program works fine and printing the CJK characters

use utf8;

my $str = "\characters{刘安雯胡水明}";

while($str=~m/(\p{InCJK_Unified_Ideographs}){1,}/xg) {  print ":: $&\n";  }

Upvotes: 0

Views: 67

Answers (1)

choroba
choroba

Reputation: 241808

use utf8;

This line tells Perl that the source code contains UTF-8, so it's not related to reading from a file.

use open ':encoding(utf8)';

This is equivalent to

use open IO => 'encoding(utf8)';

which sets the encoding for input and output streams, i.e. it doesn't change the encoding of standard input and output. To do so, you need to add :std:

use open IO => ':utf8', ':std';

The last line shown,

binmode STDOUT, ':utf8';

sets the encoding for STDOUT, which would be already covered by the previous line if it used :std.

You didn't show how you opened the file. If you used <> or readline without specifying a filehandle, you need to set the encoding for the standard input as shown above. If you used a filehandle, I'm out of ideas - it works for me.

Upvotes: 1

Related Questions