Match Chinese character in Perl

Question

I know this question has been asked before. I did check all the previous answers but still could not solve my problem. Please pardon me for the apparently duplicate question.

I'm writing a perl program to process text file in Chinese. I want to recognize the Chinese text but exclude all other lines such as English or other language and urls. I use "use utf8" and "$line =~ /(\p{Han}+)/" but it does nothing. If I use "use utf8" and "$line =~ /信息/", it does nothing. If I don't use "use utf8", "$line =~ /信息/" can work but not "$line =~ /(\p{Han}+)/". I check text file encoding with: file -bi input.txt, it shows: "text/plain; charset=utf-8". The following is the code:

$|=1;
use strict;
use utf8;

my $in = $ARGV[0];

sub main {

    open(IN, "$in") or die "can't open $in
";

    while (my $line=) {
        chomp($line);

        if ($line =~ /(\p{Han}+)/ ) { 
        print "chinese: $line
";
        }

        if ($line =~ /信息/) {
           print "$line
";
        }

    } # end while

   close(IN); 
}

Thank you in advance for any help and advice!

user149341 · Accepted Answer

You need to open the file as UTF-8:

open IN, "<:encoding(UTF-8)", $in or die "can't open $in
";

Otherwise it's read as a byte string, which isn't what you want.

Match Chinese character in Perl

Answers (2)

Related Questions