Reputation: 35
I know this question has been asked before. I did check all the previous answers but still could not solve my problem. Please pardon me for the apparently duplicate question.
I'm writing a perl program to process text file in Chinese. I want to recognize the Chinese text but exclude all other lines such as English or other language and urls. I use "use utf8
" and "$line =~ /(\p{Han}+)/
" but it does nothing. If I use "use utf8
" and "$line =~ /信息/
", it does nothing. If I don't use "use utf8
", "$line =~ /信息/
" can work but not "$line =~ /(\p{Han}+)/
". I check text file encoding with: file -bi input.txt, it shows: "text/plain; charset=utf-8
". The following is the code:
$|=1;
use strict;
use utf8;
my $in = $ARGV[0];
sub main {
open(IN, "$in") or die "can't open $in\n";
while (my $line=<IN>) {
chomp($line);
if ($line =~ /(\p{Han}+)/ ) {
print "chinese: $line\n";
}
if ($line =~ /信息/) {
print "$line\n";
}
} # end while
close(IN);
}
Thank you in advance for any help and advice!
Upvotes: 1
Views: 2545
Reputation:
You need to open the file as UTF-8:
open IN, "<:encoding(UTF-8)", $in or die "can't open $in\n";
Otherwise it's read as a byte string, which isn't what you want.
Upvotes: 8
Reputation: 89557
You must use the u modifier if you want that the regex engine treats your string as an unicode string:
/(\p{Han}+)/u
Upvotes: -2