Reputation: 3907
I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.
I was handling strings that contain contain Chinese characters and strings like Jìngyè. I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).
I wrote a small code to test different behaviors on 2 strings.
String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".
Below is my code:
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
These are my results:
before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466
decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)
chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)
before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號
decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)
chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)
Can someone please explain me why was this happening? And how to solve this so that my String will display properly.
Thank you very much.
Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:
sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
print 'URL was not defined when extracting info\n';
return 0;
}
my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
my $contain_content = $contain_request -> decoded_content;
#store address
if ($contain_content =~ m/$address_pattern/i){
print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."號");
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";
#unicode conversion
#store in database
}
}
}
Upvotes: 1
Views: 964
Reputation: 385764
First, always use use strict; use warnings;
!!!
The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.
Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F)
is obviously not a byte, so:
$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號
But it's not always obvious that something is wrong. chr(0xE8)
could be a byte, so:
$ perl -we'print "\xE8\n"'
�
The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.
Encode's encode
is used to provide character encoding. You can also have encode
called automatically using the open
module.
$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號
$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è
Upvotes: 2