simon_
simon_

Reputation: 43

parse html with XML::LibXML while not touching entities

I'm using XML::LibXML to parse a chunk of html in order to change the title attribute of all the anchor elements. The problem is that XML::LibXML tampers with un-encoded entites, and changes e.g '&' to '&' in the url params in the href attributes.

How do i tell XML::LibXML to not try to encode or decode any of these entitites?

#!/usr/bin/perl -w

use strict;
use XML::LibXML;

my $parser = XML::LibXML->new(recover => 2);

my $html = '
<div>
    <span>this & that &amp; what?</span>
    <a title="link1" href="http://url.com/foo?a=1&b=2">Link1</a>
    <a title="link2" href="http://url.com/foo?a=1&b=2">Link2</a>
</div>';

my $doc = $parser->load_html(string => $html);

for my $node ($doc->findnodes('//*[@title]')) {
    $node->setAttribute('title', 'newtitle');
}

print $doc->toString(), "\n";

__END__

which produces this output:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <span>this &amp; that &amp; what?</span>
    <a title="newtitle" href="http://url.com/foo?a=1&amp;b=2">Link1</a>
    <a title="newtitle" href="http://url.com/foo?a=1&amp;b=2">Link2</a>
</div></body></html>

As you'll see XML::LibXML has altered the urls, and also the text inside the span tag!

Upvotes: 1

Views: 2546

Answers (1)

ikegami
ikegami

Reputation: 385496

As you'll see XML::LibXML has altered the urls, and also the text inside the span tag!

You are mistaken. The URL did not change. Both the original HTML and the generated HTML produce the same URL (http://url.com/foo?a=1&b=2). The HTML is different, but the text displayed is not.

The same goes for the text in the span. Both the original HTML and the generated HTML produce the same URL (this & that & what?). The HTML is different, but the URL is not.

To my knowledge, there's no way to control what characters XML::LibXML's toString escapes. Apparently, it chooses to escape &amp; even when it's not technically required in HTML.

Any why not? There's no harm in having "&" escaped.

«this & that &amp; what?» and «this &amp; that &amp; what?» mean the same in HTML.

«href="http://url.com/foo?a=1&amp;b=2"» and «href="http://url.com/foo?a=1&b=2"» mean the same in HTML.

PS — If you want to produce HTML, you should be using ->toStringHTML(), not ->toString(). The latter produces XML.

Upvotes: 2

Related Questions