Reputation: 23749
Suppose, we have the following files:
test.html
<!DOCTYPE html>
<html>
<head>
<title>Евгений Онегин</title>
<meta charset="utf-8">
</head>
<body>
<p><cite>Евгений Онегин</cite></p>
<pre>
Не мысля гордый свет забавить,
Вниманье дружбы возлюбя,
Хотел бы я тебе представить
Залог достойнее тебя,
</pre>
</body>
</html>
I wanted to get the contents of body tag in HTML format, using parser:
<p><cite>Евгений Онегин</cite></p>
<pre>
Не мысля гордый свет забавить,
Вниманье дружбы возлюбя,
Хотел бы я тебе представить
Залог достойнее тебя,
</pre>
parser.pl
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file('test.html');
my $body = $root->find('body');
print $body->as_HTML;
When I saved the output to HTML file and watched it in the browser as Unicode, encoding is broken: instead of "Евгений Онегин" I get "Евгений Онегин".
When HTML is stored inside Perl file, it works correctly:
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use utf8;
use Data::Dumper;
use HTML::TreeBuilder;
my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);
my $body = $root->find('body');
print $body->as_HTML;
__END__
<!DOCTYPE html>
<html>
<head>
<title>Евгений Онегин</title>
<meta charset="utf-8">
</head>
<body>
<p><cite>Евгений Онегин</cite></p>
<pre>
Не мысля гордый свет забавить,
Вниманье дружбы возлюбя,
Хотел бы я тебе представить
Залог достойнее тебя,
</pre>
</body>
</html>
So, the error occurs, when HTML::TreeBuilder is reading from file.
Questions:
Е
. Is it possible to save it as the character Е
?Upvotes: 2
Views: 760
Reputation: 10913
You may use charset autodetection as documented in man HTML::TreeBuilder
.
When you pass a filename to
parse_file
,HTML::Parser
opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1
). If the file is in another encoding, likeUTF-8
orUTF-16
, this will not do the right thing.One solution is to open the file yourself using the proper
:encoding
layer, and pass the filehandle toparse_file
. You can automate this process by usinghtml_file
inIO::HTML
, which will use theHTML5
encoding sniffing algorithm to automatically determine the proper:encoding
layer and apply it.In the next major release of
HTML-Tree
, I plan to have it useIO::HTML
automatically. If you really want your file opened in binary mode, you should open it yourself and pass the filehandle toparse_file
.
Therefore, use IO::HTML
for automatic charset detection of opened files.
use HTML::TreeBuilder;
use IO::HTML; # exports html_file by default
my $root = HTML::TreeBuilder->new;
$root->parse_file(html_file('test.html'));
https://stackoverflow.com/a/24577042/2139766
Upvotes: 3
Reputation: 126742
The parse_file
method will take either a file name or a file handle, so the simplest solution is to open the file with an open
call using :utf8
as the mode, and then pass the file handle to be parsed.
It looks like this. I have used the new_from_file
constructor only because it saves a statement. It has exactly the same effect as your own code.
#!/usr/bin/env perl
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::TreeBuilder;
my $file = 'test.html';
open my $fh, '<:utf8', $file or die qq{Unable to open "$file" for parsing: $!};
my $root = HTML::TreeBuilder->new_from_file($fh);
my $body = $root->find('body');
print $body->as_HTML;
As for changing the entities to letters, I'm not clear what you mean. Do you just want to remove all hex entities and replace them with the equivalent character? You may get some mileage out of the HTML::Entities
module.
Upvotes: 5