Reputation: 14726
I have multiple XMLs (like the following) where an optional tag appears. This tag is in a namespace mynamespace
xml = %{<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:mynamespace="http://example.com/ns/1.0">
<channel>
<item>
<title>bar</title>
<mynamespace:custom_tag>some text</mynamespace:custom_tag>
</item>
<item>
<title>foo</title>
</item>
</channel>
</rss>}
Nokogiri::XML::Reader(xml).each do |node|
next if node.name!='item' || node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
node = Nokogiri::XML.parse(node.outer_xml)
puts "-> node"
puts node.namespaces
puts node.xpath("//mynamespace:custom_tag").text
end
When Nokogiri::XML::Reader(xml)
iterates over every <item>
, the first run outputs some text
. But when the second item, which doesn't contain an element with my mynamespace
namespace is parsed, it throws an error.
The output is:
-> node
{"xmlns:mynamespace"=>"http://example.com/ns/1.0"}
some text
-> node
{}
Nokogiri::XML::XPath::SyntaxError: Undefined namespace prefix: //mynamespace:custom_tag
- Why does Nokogiri include the namespace in the first item but not in the second item? Only because the first uses the namespace, and the second doesn't?
- What would be a workaround to search for tags with namespaces, even when this namespace doesn't occur in the current node?
Upvotes: 1
Views: 428
Reputation: 106027
- Why does Nokogiri include the namespace in the first item but not in the second item? Only because the first uses the namespace, and the second doesn't?
To understand the difference, look at what node.outer_xml
returns for the first <item>
:
<item xmlns:mynamespace="http://example.com/ns/1.0">
<title>bar</title>
<mynamespace:custom_tag>some text</mynamespace:custom_tag>
</item>
...versus the second:
<item>
<title>foo</title>
</item>
You'll notice that in the first case outer_xml
isn't identical to the input XML: Nokogiri helpfully includes the namespace declarations for any child elements on the parent element. In the second case, none of the elements has any namespaces, so Nokogiri doesn't include any namespace declarations.
- What would be a workaround to search for tags with namespaces, even when this namespace doesn't occur in the current node?
A simple solution would be to use a conditional to skip elements that don't include the namespace:
Nokogiri::XML::Reader(xml).each do |node|
next unless node.name == 'item' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
item_doc = Nokogiri::XML.parse(node.outer_xml)
puts "-> node"
unless item_doc.namespaces.key?("xmlns:mynamespace")
puts "Does not include namespace; skipping"
next
end
puts item_doc.xpath("//mynamespace:custom_tag").text
end
# => -> node
# some text
# -> node
# Element doesn't include namespace; skipping
You'll notice that I also changed the variable name node
inside the block with item_doc
since Nokogiri::XML.parse
returns a Nokogiri::XML::Document, not a Node, and the naming was pretty confusing.
A simpler solution would be to use Nokogiri's in-memory parser instead of XML::Reader:
doc = Nokogiri::XML(xml)
doc.xpath("//rss/channel/item/mynamespace:custom_tag").each do |node|
puts node.text
end
# => some_text
You may be using XML::Reader because the XML document is large, but unless you're experiencing actual memory or performance problems I recommend using this approach instead.
Upvotes: 1