tomsk
tomsk

Reputation: 997

Reading file breaks encoding in Perl

I have script for reading html files in Perl, it works, but it breaks encoding.

This is my script:

use utf8;
use Data::Dumper;

open my $fr, '<', 'file.html' or die "Can't open file $!";
my $content_from_file = do { local $/; <$fr> };

print Dumper($content_from_file);

Content of file.html:

<span class="previews-counter">Počet hodnotení: [%product.rating_votes%]</span>
<a href="#" title="[%L10n.msg('Zobraziť recenzie')%]" class="previews-btn js-previews-btn">[%L10n.msg('Zobraziť recenzie')%]</a>

Output from reading:

<span class=\"previews-counter\">Po\x{10d}et hodnoten\x{ed}: [%product.rating_votes%]</span>
<a href=\"#\" title=\"[%L10n.msg('Zobrazi\x{165} recenzie')%]\" class=\"previews-btn js-previews-btn\">[%L10n.msg('Zobrazi\x{165} recenzie')%]</a>

As you can see lot of characters are escaped, how can I read this file and show content of it as it is?

Upvotes: 0

Views: 78

Answers (1)

brian d foy
brian d foy

Reputation: 132905

You open the file with perl's default encoding:

open my $fh, '<', ...;

If that encoding doesn't match the actual encoding, Perl might translate some characters incorrectly. If you know the encoding, specify it in the open mode:

open my $fh, '<:utf8', ...;

You aren't done yet, though. Now that you have a probably decoded string, you want to output it. You have the same problem again. The standard output file handle's encoding has to match what you are trying to print to. If you've set up your terminal (or whatever) to expect UTF-8, you need to actually output UTF-8. One way to fix that is to make the standard filehandles use UTF-8:

use open qw(:std :utf8);

You have use utf8, but that only signals the encoding for your program file.

I've written a much longer primer for Perl and Unicode in the back of Learning Perl. The StackOverflow question Why does modern Perl avoid UTF-8 by default? has lots of good advice.

Upvotes: 4

Related Questions