Reputation: 1032
I'm using the following code to tidy a snippet of untidied HTML codes.
perl -Mutf8 -MXML::LibXML -E'
my $filename = "1.html";
open $fh, "<", $filename;
binmode $fh;
my $dom = XML::LibXML->load_html(
IO => $fh,
recover => 1,
suppress_errors => 1,
huge => 10000000,
);
say $dom->toString();
' > tidy.html
The untidied HTML codes(missing the </p>
ending tag):
1.html:
<p>aΩ<span>test</span>
As you can see, there's one special character Ω
in the <p>
tag, after the tidy process, the Ω
is encoded as Ω
as followed(tidied HTML codes):
tidy.html:
<html><body><p>aΩ<span>test</span></p></body></html>
Can I keep Ω
in its original form, instead of its encoded form in the tidy output?
Or is there any other alternatives to do the tidy process that won't encoding special characters?
Upvotes: 1
Views: 68
Reputation: 39158
The problem is not quite what you think.
The HTML parser treats the input as Latin1 as specified by the standard, but your input file is really in UTF-8. To make it work, you need to declare the correct encoding, e.g.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
Upvotes: 3