Reputation: 1875
I am using HTML::TreeBuilder
to extract contents of a url by using tree->lookdown
and then extracting text part from the string returned in lookdown method. My problem here is when I read that text and write it into a file its showing as junk. I am not able to make a progress regarding this.
My Sample Code:
use HTML::TreeBuilder;
use HTML::Element;
use utf8;
$url = $ARGV[0];
$page = `wget -qO - "$url"| tee data.txt`;
#print "iam $page\n";
my $tree = HTML::TreeBuilder->new( );
$tree->parse_file('data.txt');
my @story = $tree->look_down(
_tag => 'div',
class => 'storydescription'
);
my @title = $tree->look_down(
_tag => 'title'
);
open(OUT,">","story.txt") or die"Cannot open story.txt:$!\n";
binmode(OUT,":utf8");
foreach my $story(@story) {
print OUT $story->as_text;
}
close(OUT);
I have tried binmode for the output file handle but it was of no use and the text other than Unicode such as ascii characters prints properly into file.
Upvotes: 0
Views: 232
Reputation: 241988
It's documented in HTML::TreeBuilder:
When you pass a filename to
parse_file
,HTML::Parser
opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.One solution is to open the file yourself using the proper
:encoding
layer, and pass the filehandle toparse_file
. You can automate this process by using "html_file" inIO::HTML
, which will use the HTML5 encoding sniffing algorithm to automatically determine the proper:encoding
layer and apply it.
Upvotes: 3