Reputation: 23630
I'm trying to parse XML document to return <input>
nodes that contain a ref
attribute. A toy example works but the document itself returns an empty array, when it should show a match.
toy example
import elementtree.ElementTree
from lxml import etree
tree = etree.XML('<body><input ref="blabla"><label>Cats</label></input><input ref="blabla"><label>Dogs</label></input><input ref="blabla"><label>Birds</label></input></body>')
# I can return the relevant input nodes with:
print len(tree.findall(".//input[@ref]"))
2
But working with the following (reduced) file for some reason fails:
example.xml
<?xml version="1.0"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<h:head>
<h:title>A title</h:title>
</h:head>
<h:body>
<group ref="blabla">
<label>Group 1</label>
<input ref="blabla">
<label>Field 1</label>
</input>
</group>
</h:body>
</h:html>
script
import elementtree.ElementTree
from lxml import etree
with open ("example.xml", "r") as myfile:
xml = myfile.read()
tree = etree.XML(xml)
print len(tree.findall(".//input[@ref]"))
0
Any idea why this fails, and how to workaround? I think it may have something to do with the XML header. Very grateful for any assistance.
Upvotes: 0
Views: 1223
Reputation: 87984
I think the problem is that the elements in your entire document are in particular namespaces, so that the un-namespaced .findall(".//input[@ref]"))
expression doesn't match the input
element in the document, which is actually a namespaced input
element, in the http://www.w3.org/2002/xforms
namespace.
So maybe try this:
.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")
Updated after my original answer, to use the xforms namespace instead of the xhtml namespace (as had been noted in another answer).
Upvotes: 2
Reputation: 90859
As can be seen from your xml , the xml-namespace for non-prefixed elements is - "http://www.w3.org/2002/xforms"
, This is because that is defined as the xmlns
without any prefix in the parent element h:html
, only elements prefixed h:
have the namespace as "http://www.w3.org/1999/xhtml"
.
So you need to use that namespace in your query as well. Example -
root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")
Example/Demo -
>>> s = """<?xml version="1.0"?>
... <h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
... <h:head>
... <h:title>A title</h:title>
... </h:head>
... <h:body>
... <group ref="blabla">
... <label>Group 1</label>
... <input ref="blabla">
... <label>Field 1</label>
... </input>
... </group>
... </h:body>
... </h:html>"""
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(s)
>>> root.findall(".//{http://www.w3.org/1999/xhtml}input[@ref]")
>>> root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")
[<Element '{http://www.w3.org/2002/xforms}input' at 0x02288EA0>]
Upvotes: 2