jonah_w
jonah_w

Reputation: 1032

Prevent encoding when tidy HTML using XML::LibXML

I'm using the following code to tidy a snippet of untidied HTML codes.

    perl -Mutf8 -MXML::LibXML -E'
    my $filename = "1.html";
    open $fh, "<", $filename; 
    binmode $fh;
    my $dom = XML::LibXML->load_html(
    IO  => $fh,
    recover   => 1,
    suppress_errors => 1, 
    huge => 10000000,
    );
    say $dom->toString();
    ' > tidy.html

The untidied HTML codes(missing the </p> ending tag):

1.html:

<p>aΩ<span>test</span>

As you can see, there's one special character Ω in the <p> tag, after the tidy process, the Ω is encoded as &#xCE;&#xA9; as followed(tidied HTML codes):

tidy.html:

<html><body><p>a&#xCE;&#xA9;<span>test</span></p></body></html>

Can I keep Ω in its original form, instead of its encoded form in the tidy output?

Or is there any other alternatives to do the tidy process that won't encoding special characters?

Upvotes: 1

Views: 68

Answers (1)

daxim
daxim

Reputation: 39158

The problem is not quite what you think.

The HTML parser treats the input as Latin1 as specified by the standard, but your input file is really in UTF-8. To make it work, you need to declare the correct encoding, e.g.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

Upvotes: 3

Related Questions