Reputation: 42792
I need to match some chinese character in a utf8 encoded html , and I wrote some test code as below :
#! /usr/bin/perl
use strict;
use LWP::UserAgent;
use Encode;
my $ua = new LWP::UserAgent;
my $request = HTTP::Request->new('GET');
my $url = 'http://www.boc.cn/sourcedb/whpj/';
$request->url($url);
my $res = $ua->request($request) ;
my $str_chinese = encode("utf8" ,"英磅" ) ;
# my $str_chinese = "英磅" ;
my $str_english = "English" ;
#my $html = decode("utf8" , $res->content) ;
my $html = $res->content ;
if ( $html =~ /$str_chinese/ ) {
print "chinese word matched" ;
}else {
print "chinese word unmatched\n" ;
}
if ( $html =~ /$str_english/i ) {
print "english word matched\n" ;
}else {
print "english word unmatched\n" ;
}
The output shows that the the script fail to match the existing chinese characters embeded in the html. could you give me some hint on how to solve my problem ?
Upvotes: 3
Views: 4246
Reputation: 39158
You should use the method decoded_content
from the class HTTP::Message
instead. Manual decoding is not necessary.
#!/usr/bin/env perl
use utf8;
use strict;
use LWP::UserAgent;
my $html = LWP::UserAgent->new
->get('http://www.boc.cn/sourcedb/whpj/')
->decoded_content;
my $str_chinese = '首页';
my $str_english = 'English';
if ($html =~ /$str_chinese/) {
print "chinese word matched\n";
} else {
print "chinese word unmatched\n";
}
if ($html =~ /$str_english/i) {
print "english word matched\n";
} else {
print "english word unmatched\n";
}
Output:
chinese word matched
english word matched
Upvotes: 3
Reputation: 127
I run your code and the Chinese characters are not matched.
Then I check the html, it does not contains these characters. So this may be the reason for non-matching case. I then tried for some other character (联) and also remove the encode function.
i.e. my $str_chinese = "联";
Run the code with this change and the character is matched.
Upvotes: 4
Reputation: 74232
Since you have added UTF-8 characters in the source code, you have to:
use utf8;
It tells Perl that your script is written in UTF-8.
Upvotes: 7