Adam
Adam

Reputation: 2552

Unable to retrieve comment from XML due to namespace prefix issue Python

I've got the following "example.xml" document where my main goal is to be able to retrieve the comments for each tag in the document. Note, I've been able to retrieve the comments thanks to this answer, where there are no namespace prefixes, but given this, I'm getting the below errors.

<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <tag1>
    <tag2>
        <tag3>tag3<!-- comment = “this is the tag3.1 comment”--></tag3>
        <tag4>tag4<!-- comment = “this is the tag4.1 comment”--></tag4>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
        <tag3>tag3<!-- comment = “this is the tag3.2 comment”--></tag3>
        <tag4>tag4<!-- comment = “this is the tag4.2 comment”--></tag4>
    </tag2>
  </tag1>
</abc:root>

I've tried to go through two options, both resulting in errors.

I'm essentially iterating through each node of the document and checking for the comment associated. The code is as follows:

from lxml import etree
import os

tree = etree.parse("example.xml")
rootXML = tree.getroot()

print(rootXML.nsmap)

for Node in tree.xpath('//*'):
    elements = tree.xpath(tree.getpath(Node), rootXML.nsmap)
    basename = os.path.basename(tree.getpath(Node))
    for tag in elements:
        comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
        print(tree.getpath(Node))
        print(comment)

Executing this code however, gives me the following error:

TypeError: xpath() takes exactly 1 positional argument (2 given)

I've also tried to follow this answer and define the namespace within the xpath. In doing so, my code becomes:

from lxml import etree
import os

tree = etree.parse("example.xml")
rootXML = tree.getroot()

print(rootXML.nsmap)

for Node in tree.xpath('//*'):
    elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap})
    basename = os.path.basename(tree.getpath(Node))
    for tag in elements:
        comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node)))
        print(tree.getpath(Node))
        print(comment)

where the only change is replacing elements = tree.xpath(tree.getpath(Node), rootXML.nsmap) with elements = tree.xpath(tree.getpath(Node), namespaces={rootXML.nsmap}). However, this then results in the following error at the modified line.

TypeError: unhashable type: 'dict'

EDIT: modified a closing bracket as per one of the answers.

Upvotes: 0

Views: 213

Answers (1)

Acorn
Acorn

Reputation: 50557

You are missing a closing bracket at the end of this line:

comment = tag.xpath('{0}/comment()'.format(tree.getpath(Node))


Update

Here's a working example:

from lxml import etree
import os

xml = """<?xml version="1.0" encoding="UTF-8"?>
<abc:root xmlns:abc="http://com/example/URL" xmlns:abcdef="http://com/another/example/URL" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <tag1>
    <tag2>
        <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
        <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
        <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
        <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
</abc:root>""".encode('utf-8')

rootElement = etree.fromstring(xml)
rootTree = rootElement.getroottree()

print(rootElement.nsmap)

for Node in rootTree.xpath('//*'):
    elements = rootTree.xpath(rootTree.getpath(Node), namespaces=rootElement.nsmap)
    basename = os.path.basename(rootTree.getpath(Node))
    for tag in elements:
        comment = tag.xpath('{0}/comment()'.format(rootTree.getpath(Node)), namespaces=rootElement.nsmap)
        print(rootTree.getpath(Node))
        print(comment)

The main issue was trying to pass the namespaces to getPath as a positional argument, when they need to be given using the namespaces keyword argument. The other issue was trying to call methods on an _Element when they can only be called on _ElementTrees and vice versa.

Also in your second example you try and do this namespaces={rootXML.nsmap}. rootXML.nsmap is already a dictionary, you don't need any curly braces. Also, that syntax would not create a dictionary, it would create a Set, hence it complaining that the thing you're trying to put in it is not hashable.

Upvotes: 1

Related Questions