Reputation: 929
I am trying to add an additional attribute to the existing XML node using XML:LibXML, when I trying to achieve this, all encoded entities like &dagger
, ¶
are converted into plain UTF-8 character. How to avoid this conversion and retain the original encoding?
XML:
<?xml version="1.0"?>
<!DOCTYPE test SYSTEM "test.dtd">
<test>
<name>
<firstName>firstname‡</firstName>
<lastName>last name</lastName>
</name>
<name>
<firstName>first name</firstName>
<lastName>last name</lastName>
</name>
</test>
Code:
use strict;
use warnings;
use XML::LibXML;
my $parser = new XML::LibXML;
$parser->validation(1);
$parser->load_ext_dtd(1);
my $doc = $parser->parse_file($instance);
foreach my $new ($doc->findnodes('test'))
{
my($name) = $new->findnodes('//firstName');
print $name."\n";
}
I am getting the output <firstName>firstname‡</firstName>
with converted encode, along with warning Wide character in print at perlfile.pl
.
If I use encode print encode_entities($name)."\n"; with the help of use HTML::Entities;
I can get the encoded entities but I don't want to use this since I may get a utf-8 character instead of an entity in my text. So I want to retain the text as it is in the output. Is there any way to do this?
Upvotes: 1
Views: 444
Reputation: 2589
This can be done by the expand_entities()
;
use strict;
use warnings;
use XML::LibXML;
my $parser = new XML::LibXML;
#for the output you need utf8
binmode STDOUT, ':utf8';
$parser->validation(1);
$parser->load_ext_dtd(1);
#Use expand_entities for retain the entities
$parser->expand_entities(0);
my $doc = $parser->parse_file("test.xml");
foreach my $new ($doc->findnodes('test'))
{
my($name) = $new->findnodes('//firstName');
print $name."\n";
}
Upvotes: 3
Reputation: 86774
This will probably require tweaking the serializer, if it's possible at all.
Entities are syntactic sugar and get replaced with the 'real' characters while parsing. The entity strings &[entity-name];
do not exist in the DOM representation.
If the output encoding (UTF-8 in your case) supports the characters natively that's what the serializer is going to write as it has no idea what the characters looked like in the source document.
I took a quick look at the documentation and didn't see anything of use for controlling entity output.
Upvotes: 3