VSe
VSe

Reputation: 929

Do not disturb encoded entites (wide characters) in XML::LibXML

I am trying to add an additional attribute to the existing XML node using XML:LibXML, when I trying to achieve this, all encoded entities like &dagger, ¶ are converted into plain UTF-8 character. How to avoid this conversion and retain the original encoding?

XML:

     <?xml version="1.0"?>
     <!DOCTYPE test SYSTEM "test.dtd">
     <test>
     <name>
          <firstName>firstname&Dagger;</firstName>
         <lastName>last name</lastName>
     </name>
     <name>
        <firstName>first name</firstName>
        <lastName>last name</lastName>
    </name>
  </test>

Code:

  use strict;
  use warnings;
  use XML::LibXML;
  my $parser = new XML::LibXML;
  $parser->validation(1);
  $parser->load_ext_dtd(1);
  my $doc  = $parser->parse_file($instance);
  foreach my $new ($doc->findnodes('test'))
   {
    my($name) = $new->findnodes('//firstName');
    print $name."\n";
   }

I am getting the output <firstName>firstname‡</firstName> with converted encode, along with warning Wide character in print at perlfile.pl.

If I use encode print encode_entities($name)."\n"; with the help of use HTML::Entities; I can get the encoded entities but I don't want to use this since I may get a utf-8 character instead of an entity in my text. So I want to retain the text as it is in the output. Is there any way to do this?

Upvotes: 1

Views: 444

Answers (2)

ssr1012
ssr1012

Reputation: 2589

This can be done by the expand_entities();

use strict;
use warnings;
use XML::LibXML;
my $parser = new XML::LibXML;

#for the output you need utf8 
binmode STDOUT, ':utf8'; 

$parser->validation(1);
$parser->load_ext_dtd(1);

#Use expand_entities for retain the entities
$parser->expand_entities(0);
my $doc  = $parser->parse_file("test.xml");
foreach my $new ($doc->findnodes('test'))
{
my($name) = $new->findnodes('//firstName');
print $name."\n";
}

check more info

Upvotes: 3

Jim Garrison
Jim Garrison

Reputation: 86774

This will probably require tweaking the serializer, if it's possible at all.

Entities are syntactic sugar and get replaced with the 'real' characters while parsing. The entity strings &[entity-name]; do not exist in the DOM representation.

If the output encoding (UTF-8 in your case) supports the characters natively that's what the serializer is going to write as it has no idea what the characters looked like in the source document.

I took a quick look at the documentation and didn't see anything of use for controlling entity output.

Upvotes: 3

Related Questions