querystack
querystack

Reputation: 35

Match Chinese character in Perl

I know this question has been asked before. I did check all the previous answers but still could not solve my problem. Please pardon me for the apparently duplicate question.

I'm writing a perl program to process text file in Chinese. I want to recognize the Chinese text but exclude all other lines such as English or other language and urls. I use "use utf8" and "$line =~ /(\p{Han}+)/" but it does nothing. If I use "use utf8" and "$line =~ /信息/", it does nothing. If I don't use "use utf8", "$line =~ /信息/" can work but not "$line =~ /(\p{Han}+)/". I check text file encoding with: file -bi input.txt, it shows: "text/plain; charset=utf-8". The following is the code:

$|=1;
use strict;
use utf8;

my $in = $ARGV[0];

sub main {

    open(IN, "$in") or die "can't open $in\n";

    while (my $line=<IN>) {
        chomp($line);

        if ($line =~ /(\p{Han}+)/ ) { 
        print "chinese: $line\n";
        }

        if ($line =~ /信息/) {
           print "$line\n";
        }

    } # end while

   close(IN); 
}

Thank you in advance for any help and advice!

Upvotes: 1

Views: 2545

Answers (2)

user149341
user149341

Reputation:

You need to open the file as UTF-8:

open IN, "<:encoding(UTF-8)", $in or die "can't open $in\n";

Otherwise it's read as a byte string, which isn't what you want.

Upvotes: 8

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You must use the u modifier if you want that the regex engine treats your string as an unicode string:

/(\p{Han}+)/u

Upvotes: -2

Related Questions