hook38
hook38

Reputation: 3907

How to properly display HTML entities in perl

I was writing a web crawler using PERL, and I realized there was a weird behavior when I try to display string using HTML::Entities::decode_entities.

I was handling strings that contain contain Chinese characters and strings like Jìngyè. I used HTML::Entities::decode_entities to decode chinese characters, which works well. However, when the string contain no Chinese characters, the string displayed weirdly (J�ngy�).

I wrote a small code to test different behaviors on 2 strings.

String 1 is "No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466" and string 2 was "104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號".

Below is my code:

print "before: $1\n";
my $decoded = HTML::Entities::decode_entities($1."&#34399");#I add the last character just for testing
print "decoded $decoded\n";
my $chopped = substr($decoded, 0, -1);
print "chopped: $chopped\n";

These are my results:

before: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466

decoded No. 22, Jìngyè 3rd Road, Jhongshan District, Taipei City, Taiwan 10466號 (correct)

chopped: No. 22, J�ngy� 3rd Road, Jhongshan District, Taipei City, Taiwan 10466 (incorrect)

before: 104 Taiwan Taipei City Jhongshan District J�ngy� 3rd Road 20號

decoded 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號號 (correct)

chopped: 104 Taiwan Taipei City Jhongshan District Jìngyè 3rd Road 20號 (correct)

Can someone please explain me why was this happening? And how to solve this so that my String will display properly.

Thank you very much.

Sorry, I did not make my question clear, below is the code I wrote, where URL is http://maps.google.com/maps/place?cid=10931902633578573013:

sub getInfoURLs {
my ($url) = @_;
unless (defined $url){
    print 'URL was not defined when extracting info\n';
    return 0;
}

my $contain_request = LWP::UserAgent->new->get($url);
if($contain_request -> is_success){
    my $contain_content = $contain_request -> decoded_content;

    #store address
    if ($contain_content =~ m/$address_pattern/i){
        print "before: $1\n";
        my $decoded = HTML::Entities::decode_entities($1."&#34399");
        print "decoded $decoded\n";
        my $chopped = substr($decoded, 0, -1);
        print "chopped: $chopped\n";
        #unicode conversion
        #store in database            
    }
 }
}

Upvotes: 1

Views: 964

Answers (1)

ikegami
ikegami

Reputation: 385764

First, always use use strict; use warnings;!!!

The problem is that you're not encoding your output. File handles can only transmit bytes, but you're passing decoded text.

Perl will output UTF-8 (-ish) when you pass something that's obviously wrong. chr(0x865F) is obviously not a byte, so:

$ perl -we'print "\xE8\x{865F}\n"'
Wide character in print at -e line 1.
è號

But it's not always obvious that something is wrong. chr(0xE8) could be a byte, so:

$ perl -we'print "\xE8\n"'
�

The process of converting a value into to a series of bytes is called "serialization". The specific case of serializing text is known as character encoding.

Encode's encode is used to provide character encoding. You can also have encode called automatically using the open module.

$ perl -we'use open ":std", ":locale"; print "\xE8\x{865F}\n"'
è號

$ perl -we'use open ":std", ":locale"; print "\xE8\n"'
è

Upvotes: 2

Related Questions