TerpFan
TerpFan

Reputation: 23

Perl Encoding - Saving File to UTF8

I have a script that will download www pages, and I want to extract the text and store it in a uniform encoding (UTF8 would be fine). The downloading (UserAgent), Parsing (TreeBuilder) and text extraction seem fine, but I'm not sure I'm saving them correctly.

They dont view when opening the output file in for example notepad++; The original HTML views find in a text editor.

The HTML files typically have charset=windows-1256 or charset=UTF-8

So I figured if I could get the UTF8 one to work, then it was just an recoding problem. Here is some of what I have tried, assuming I have an HTML file saved to disk.

my $tree = HTML::TreeBuilder->new;
$tree->parse_file("$inhtml");
$tree->dump;

The output from dump captured for STDOUT views correctly in .txt file only after Switching the encoding to utf8 in the text editor…

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
if (utf8::is_utf8($formatter->format($tree))) {
    print "   Is UTF8\n";
}
else {
    print "   Not UTF8\n";
}

Result Shows this IS UTF8 when the content says it is, and Not UTF8 otherwise.

I have tired

opening an file with ">" and ">:utf8"
binmode(MYFILE, ":utf8");
encode("utf8", $string); (where string is the output of formatter->format(tree))

But nothing seems to work correctly.

Any experts out there know what Im missing?

Thanks in advance!

Upvotes: 2

Views: 1052

Answers (2)

Ωmega
Ωmega

Reputation: 43673

This example can help you to find what you need:

use strict;
use warnings;
use feature qw(say);
use HTML::TreeBuilder qw( );
use Object::Destroyer qw( );

open(my $fh_in,  "<:encoding(cp1252)", $ARGV[0]) or die $!;
open(my $fh_out, ">:encoding(UTF-8)",  $ARGV[1]) or die $!;

my $tree = Object::Destroyer->new(HTML::TreeBuilder->new(), 'delete');
$tree->parse_file($fh_in);

my $h1Element = $tree->look_down("_tag", "h1");
my $h1TrimmedText = $h1Element->as_trimmed_text();
say($fh_out $h1TrimmedText);

Upvotes: 2

w.k
w.k

Reputation: 8376

I really like the module utf8::all (unfortunately not in core).

Just use utf8::all and you have no worries about IO, when you work only with UTF-8 files.

Upvotes: -3

Related Questions