Reputation: 33
I use lxml (Python 3.7.1) to parse an xml document containing xml entities.
I can't manage to get the right syntax to query an element containing xml entities ("
, '
, etc.).
See the following code:
from lxml import etree
root = etree.XML('''
<root>
<item name="abcd">
<ssitem att="efg"/>
</item>
<item name="hi'jk">
<ssitem att="lmn"/>
</item>
</root>
''')
item = root.xpath(".//item[@name='abcd']") # 1
# item = root.xpath(".//item[@name='hi'jk']") # 2
# item = root.xpath(".//item[@name='hi'jk']") # 3
# item = root.xpath('.//item[@name="hi''jk"]') # 4
if len(item) != 0:
print(len(item))
print(item)
name = item[0].xpath(".//@name")
print(name)
else:
print("Nothing")
When line 1 is uncommented, the code works fine.
When line 2 (or 3, or 4) is uncommented (and other lines are commented), nothing is found.
Why is this the case?
Thanks.
Upvotes: 1
Views: 143
Reputation: 107587
Consider escaping the single apostrophe with last #4 option:
item = root.xpath('.//item[@name="hi\'jk"]') # 4
item
# [<Element item at 0xbe25608>]
Upvotes: 0
Reputation: 476604
Here '
is part of the encoding in an XML file.
In the XPath query, you should use:
>>> root.xpath(""".//item[@name="hi'jk"]""")
[<Element item at 0x7f91b2b9ae88>]
Upvotes: 1