grabo44
grabo44

Reputation: 33

lxml xpath How to deal with xml entities

I use lxml (Python 3.7.1) to parse an xml document containing xml entities. I can't manage to get the right syntax to query an element containing xml entities (&quot, ', etc.).
See the following code:

from lxml import etree

root = etree.XML('''
<root>
    <item name="abcd">
        <ssitem att="efg"/>
    </item>
    <item name="hi&apos;jk">
        <ssitem att="lmn"/>
    </item>
</root>
''')

item = root.xpath(".//item[@name='abcd']") # 1
# item = root.xpath(".//item[@name='hi&apos;jk']") # 2
# item = root.xpath(".//item[@name='hi'jk']") # 3
# item = root.xpath('.//item[@name="hi''jk"]') # 4
if len(item) != 0:
    print(len(item))
    print(item)
    name = item[0].xpath(".//@name")
    print(name)
else:
    print("Nothing")  

When line 1 is uncommented, the code works fine.

When line 2 (or 3, or 4) is uncommented (and other lines are commented), nothing is found.

Why is this the case?

Thanks.

Upvotes: 1

Views: 143

Answers (2)

Parfait
Parfait

Reputation: 107587

Consider escaping the single apostrophe with last #4 option:

item = root.xpath('.//item[@name="hi\'jk"]') # 4
item

# [<Element item at 0xbe25608>]

Upvotes: 0

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 476604

Here &apos; is part of the encoding in an XML file.

In the XPath query, you should use:

>>> root.xpath(""".//item[@name="hi'jk"]""")
[<Element item at 0x7f91b2b9ae88>]

Upvotes: 1

Related Questions