user4035
user4035

Reputation: 23749

Encoding broken after using HTML::TreeBuilder as_HTML

Suppose, we have the following files:

test.html

<!DOCTYPE html>
<html>
  <head>
    <title>Евгений Онегин</title>
    <meta charset="utf-8">
  </head>
  <body>
    <p><cite>Евгений Онегин</cite></p>
    <pre>
      Не мысля гордый свет забавить,
      Вниманье дружбы возлюбя,
      Хотел бы я тебе представить
      Залог достойнее тебя,
    </pre>
</body>
</html>

I wanted to get the contents of body tag in HTML format, using parser:

<p><cite>Евгений Онегин</cite></p>
<pre>
  Не мысля гордый свет забавить,
  Вниманье дружбы возлюбя,
  Хотел бы я тебе представить
  Залог достойнее тебя,
</pre>

parser.pl

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file('test.html');

my $body = $root->find('body');
print $body->as_HTML;

When I saved the output to HTML file and watched it in the browser as Unicode, encoding is broken: instead of "Евгений Онегин" I get "Евгений Онегин".

Correct work

When HTML is stored inside Perl file, it works correctly:

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use Data::Dumper;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);

my $body = $root->find('body');
print $body->as_HTML;

__END__
<!DOCTYPE html>
<html>
  <head>
    <title>Евгений Онегин</title>
    <meta charset="utf-8">
  </head>
  <body>
    <p><cite>Евгений Онегин</cite></p>
    <pre>
      Не мысля гордый свет забавить,
      Вниманье дружбы возлюбя,
      Хотел бы я тебе представить
      Залог достойнее тебя,
    </pre>
</body>
</html>

So, the error occurs, when HTML::TreeBuilder is reading from file.

Questions:

  1. How to fix the encoding?
  2. The module is encoding every Russian character as an entity: &#x415;. Is it possible to save it as the character Е?

Upvotes: 2

Views: 760

Answers (2)

AnFi
AnFi

Reputation: 10913

You may use charset autodetection as documented in man HTML::TreeBuilder.

When you pass a filename to parse_file, HTML::Parser opens it in binary mode, which means it's interpreted as Latin-1 (ISO-8859-1). If the file is in another encoding, like UTF-8 or UTF-16, this will not do the right thing.

One solution is to open the file yourself using the proper :encoding layer, and pass the filehandle to parse_file. You can automate this process by using html_file in IO::HTML, which will use the HTML5 encoding sniffing algorithm to automatically determine the proper :encoding layer and apply it.

In the next major release of HTML-Tree, I plan to have it use IO::HTML automatically. If you really want your file opened in binary mode, you should open it yourself and pass the filehandle to parse_file.

Therefore, use IO::HTML for automatic charset detection of opened files.

use HTML::TreeBuilder;
use IO::HTML;  # exports html_file by default

my $root = HTML::TreeBuilder->new;
$root->parse_file(html_file('test.html'));

https://stackoverflow.com/a/24577042/2139766

Upvotes: 3

Borodin
Borodin

Reputation: 126742

The parse_file method will take either a file name or a file handle, so the simplest solution is to open the file with an open call using :utf8 as the mode, and then pass the file handle to be parsed.

It looks like this. I have used the new_from_file constructor only because it saves a statement. It has exactly the same effect as your own code.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use HTML::TreeBuilder;

my $file = 'test.html';

open my $fh, '<:utf8', $file or die qq{Unable to open "$file" for parsing: $!};
my $root = HTML::TreeBuilder->new_from_file($fh);

my $body = $root->find('body');
print $body->as_HTML;

As for changing the entities to letters, I'm not clear what you mean. Do you just want to remove all hex entities and replace them with the equivalent character? You may get some mileage out of the HTML::Entities module.

Upvotes: 5

Related Questions