dgate
dgate

Reputation: 61

XML::LibXML findnodes() does not return results when xmlns is present

I'm using XML::LibXML::Reader to parse a large document and have run into an issue whereby the attribute xmlns causes findnodes() to fail. I fixed it by added a regex to remove the xmls attribute but I was wondering if there was a more elegant solution involving no regexes. If you remove the regex line ($xml =~ s{xmlns...) you'll see that say "Loc = $loc" produces no results.

Here's the code:

use strict;
use warnings;
use feature qw( say );
use XML::LibXML::Reader qw( XML_READER_TYPE_ELEMENT );

my $xml = <<'__EOI__';
<url xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <loc>http://example.com</loc>
    <lastmod>2018-10-19</lastmod>
</url>
__EOI__


$xml =~ s{xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"}{};

my $reader = XML::LibXML::Reader->new( string => $xml);
while ( $reader->read ) {
    next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
    next unless $reader->name eq 'url';
    my $xml = $reader->readOuterXml;
    my $doc = XML::LibXML->load_xml(string => $xml);
    say "Doc = $doc";
    my ($loc) = $doc->findnodes('//loc');
    say "Loc = $loc";
}

Upvotes: 3

Views: 1287

Answers (2)

Grant McLean
Grant McLean

Reputation: 6998

Your code starts by using the XML::LibXML::Reader API and then later uses XML::LibXML->load_xml to create a DOM from part of the document. The XML::LibXML::Reader API is usually only used with huge XML documents that would consume large amounts of memory when loaded as a DOM. If your XML document is not huge, then it's much simpler to use an approach like ikegami's answer which just uses the DOM API to load the whole document and then query it with XPath.

However, if you really do have a huge XML document then you might be interested in solving the problem using the Reader API:

my $sitemap_uri = 'http://www.sitemaps.org/schemas/sitemap/0.9';
my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs(sm => $sitemap_uri);

my $reader = XML::LibXML::Reader->new(location => './sitemap.xml');
while ($reader->read) {
    $reader->nextElement('url', $sitemap_uri) or last;
    my $doc = $reader->copyCurrentNode(1);
    say "Doc = $doc";
    my ($loc) = $xpc->findnodes('//sm:loc', $doc);
    say "Loc = $loc";
}

The call to $reader->nextElement is a quick way to skip forward to the next occurrence of a particular element. In this example I matched on both the element name and it's namespace.

The call to $reader->copyCurrentNode(1) is a convenience method that returns that node and all it's child nodes as a DOM fragment. You'll need to use XML::LibXML::XPathContext to query that DOM using namespace-aware XPath statements.

My XML::LibXML tutorial includes coverage of working with XML namespaces as well as working with large documents.

Upvotes: 2

ikegami
ikegami

Reputation: 385496

You ask to find nodes with namespace null and with name loc. There are no such nodes in the document, so findnodes correctly returns nothing.

You want to find the nodes with namespace http://www.sitemaps.org/schemas/sitemap/0.9 and with name loc. You can use the following to achieve that:

my $doc = XML::LibXML->load_xml( string => $xml );

my $xpc = XML::LibXML::XPathContext->new();
$xpc->registerNs( sm => 'http://www.sitemaps.org/schemas/sitemap/0.9' );

my ($loc) = $xpc->findnodes('//sm:loc', $doc);

Upvotes: 5

Related Questions