Reputation: 827
I am trying to download contents (formulas) of a web page in Perl. I have used "LWP::UserAgent" module to parse the content and taken care to check for UTF8 format. The code is as follows:
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = $ua->get('http://www.abc.org/patent/formulae');
my $content =$response->decoded_content();
if (utf8::is_utf8($content))
{
binmode STDOUT,':utf8';
}
else
{
binmode STDOUT,':raw';
}
print $content;
But I still get wide characters & the output is as follows:
"Formula = Ï Ì â¡ ( c + / c 0 ) â 1 "
Whereas I want:
"Fromula = Ï Ì â¡ ( c + / c 0 ) â 1 "
How can we avoid that?
Upvotes: 0
Views: 701
Reputation: 57590
The decoded_content
uses encoding and charset information available in the HTTP header to decode your data. However, HTML files may specify a different encoding.
If you want your output file to be utf8, you should always apply the :utf8
layer. What you are trying to do with is_uft8
is wrong.
Perl strings are internally stored with two different encodings. This is absolutely irrelevant to you, the programmer. The is_utf8
just reads the value of an internal flag that determines this internal representation. Just because this flag isn't set doesn't mean that one codepoint in your string may not be encoded as multiple bytes when encoded as utf8
.
The data you fetch from the server has various levels of encodings
"
.The decoded_content
takes care of the first two levels, the rest is left for you. To remove entities, you can use the HTML::Entities
module. Duh.
use open qw/:std :utf8/; # Apply :utf8 layer to STD{IN,OUT,ERR}
...;
if ($response->is_success) {
my $content = $response->decoded_content;
print decode_entities $content;
}
Note that I cannot verify that this works; the URL you gave 404s (irritatingly, without sending the 404 status code).
Upvotes: 3