Robie Nayak
Robie Nayak

Reputation: 827

How to avoid wide characters in LWP::UserAgent?

I am trying to download contents (formulas) of a web page in Perl. I have used "LWP::UserAgent" module to parse the content and taken care to check for UTF8 format. The code is as follows:

use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.abc.org/patent/formulae');

my $content =$response->decoded_content();

if (utf8::is_utf8($content))
{
    binmode STDOUT,':utf8';
}
else
{
    binmode STDOUT,':raw';
}

print $content;

But I still get wide characters & the output is as follows:

"Formula = 
 
 
 Ï
 
 
 Ì
 
 
 â¡
 
 (
 
 
 c
 
 
 +
 
 
 /
 
 
 c
 
 
 0
 
 
 )
 
 â
 1
 "

Whereas I want:

"Fromula = Ï Ì â¡ ( c + / c 0 ) â 1 "

How can we avoid that?

Upvotes: 0

Views: 701

Answers (1)

amon
amon

Reputation: 57590

The decoded_content uses encoding and charset information available in the HTTP header to decode your data. However, HTML files may specify a different encoding.

If you want your output file to be utf8, you should always apply the :utf8 layer. What you are trying to do with is_uft8 is wrong.

Perl strings are internally stored with two different encodings. This is absolutely irrelevant to you, the programmer. The is_utf8 just reads the value of an internal flag that determines this internal representation. Just because this flag isn't set doesn't mean that one codepoint in your string may not be encoded as multiple bytes when encoded as utf8.

The data you fetch from the server has various levels of encodings

  • encodings like compression
  • charsets
  • the charset specified by the HTML
  • HTML entities like &quot.

The decoded_content takes care of the first two levels, the rest is left for you. To remove entities, you can use the HTML::Entities module. Duh.

use open qw/:std :utf8/;  # Apply :utf8 layer to STD{IN,OUT,ERR}

...;

if ($response->is_success) {
  my $content = $response->decoded_content;
  print decode_entities $content;
}

Note that I cannot verify that this works; the URL you gave 404s (irritatingly, without sending the 404 status code).

Upvotes: 3

Related Questions