Encoding broken after using HTML::TreeBuilder as_HTML

Question

Suppose, we have the following files:

test.html



  
    Евгений Онегин
    
  
  
    Евгений Онегин
          Не мысля гордый свет забавить,
      Вниманье дружбы возлюбя,
      Хотел бы я тебе представить
      Залог достойнее тебя,

I wanted to get the contents of body tag in HTML format, using parser:

Евгений Онегин
  Не мысля гордый свет забавить,
  Вниманье дружбы возлюбя,
  Хотел бы я тебе представить
  Залог достойнее тебя,

parser.pl

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file('test.html');

my $body = $root->find('body');
print $body->as_HTML;

When I saved the output to HTML file and watched it in the browser as Unicode, encoding is broken: instead of "Евгений Онегин" I get "Ð•Ð²Ð³ÐµÐ½Ð¸Ð¹ ÐžÐ½ÐµÐ³Ð¸Ð½".

Correct work

When HTML is stored inside Perl file, it works correctly:

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use Data::Dumper;
use HTML::TreeBuilder;

my $root = HTML::TreeBuilder->new;
$root->parse_file(\*DATA);

my $body = $root->find('body');
print $body->as_HTML;

__END__


  
    Евгений Онегин
    
  
  
    Евгений Онегин
          Не мысля гордый свет забавить,
      Вниманье дружбы возлюбя,
      Хотел бы я тебе представить
      Залог достойнее тебя,

So, the error occurs, when HTML::TreeBuilder is reading from file.

Questions:

How to fix the encoding?
The module is encoding every Russian character as an entity: Е. Is it possible to save it as the character Е?

Borodin · Accepted Answer

The parse_file method will take either a file name or a file handle, so the simplest solution is to open the file with an open call using :utf8 as the mode, and then pass the file handle to be parsed.

It looks like this. I have used the new_from_file constructor only because it saves a statement. It has exactly the same effect as your own code.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use utf8;

use HTML::TreeBuilder;

my $file = 'test.html';

open my $fh, '<:utf8', $file or die qq{Unable to open "$file" for parsing: $!};
my $root = HTML::TreeBuilder->new_from_file($fh);

my $body = $root->find('body');
print $body->as_HTML;

As for changing the entities to letters, I'm not clear what you mean. Do you just want to remove all hex entities and replace them with the equivalent character? You may get some mileage out of the HTML::Entities module.

Encoding broken after using HTML::TreeBuilder as_HTML

Correct work

Answers (2)

Related Questions