Adam
Adam

Reputation: 2552

Retrieve comment from specific XML node in Python

I have the following "example.xml" file

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>
  <tag2>tag2<!-- comment = “this is the tag1 comment”--></tag2>
    <tag3>
        <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag3>
  </tag1>
</root>

I'd like to retrieve the comment to a specific node. For now, I'm only able to retrieve all comments from the file, using the following

from lxml import etree

tree = etree.parse("example.xml")
comments = tree.xpath('//comment()')
print(comments)

As expected, this returns all the above comments from the file in a list:

[<!-- comment = \u201cthis is the tag1 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

However, how and where do I explicitly specify the node to which I want to retrieve its comment? For example, how can I specify somewhere tag2 to only return <!-- comment = \u201cthis is the tag4 comment\u201d-->

EDIT

I have a use case where I need to iterate over each node of the XML file. If the iterator comes to a node that has more than one child with a comment, it returns all the comments of its children. For example, consider the following "example2.xml" file:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
</root>

If I follow the same steps as above, when the loop iterates at tag1/tag2, it returns all of the comments for tag3 and tag4.

I.e.:

from lxml import etree

tree = etree.parse("example2.xml")
comments = tree.xpath('tag1[1]/tag2//comment()')
print(comments)

returns

[<!-- comment = \u201cthis is the tag3 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

My two questions are therefore:

  1. How can I just return the comment of the direct node rather than including any of its children?
  2. As the result is returned in the form of a list, how can I retrieve the value/text of the comment from said list?

Upvotes: 1

Views: 1005

Answers (3)

Maurice Meyer
Maurice Meyer

Reputation: 18116

You need to specify the node:

tree = etree.parse("example.xml")
comments = tree.xpath('//tag2/comment()')
print(comments)

Output:

[<!-- comment = “this is the tag1 comment”-->]

Edit:

For your nested structure, you need to iterate over the repeating tags:

tag2Elements = tree.xpath('//tag1/tag2')
for t2 in tag2Elements:
    t3Comment = t2.xpath('tag3/comment()')
    print(t2, t3Comment)

Output:

<Element tag2 at 0x1066b69b0> [<!-- comment = “this is the tag3 comment”-->]
<Element tag2 at 0x1066b6960> [<!-- comment = “this is the tag3 comment”-->]

Upvotes: 1

larsks
larsks

Reputation: 311750

You can get the first comment like this:

>>> from lxml import etree
>>> with open('data.xml') as fd:
...  doc = etree.parse(fd)
...
>>> doc.xpath('/root/tag1/tag2/comment()')
[<!-- comment = “this is the tag1 comment”-->]

And for the last comment:

>>> doc.xpath('/root/tag1/tag3/tag4/comment()')
[<!-- comment = “this is the tag4 comment”-->]

...and of course you can use //tag2 or //tag4 if those elements are unique and you don't want to use the full path.

Upvotes: 1

R&#250;ben
R&#250;ben

Reputation: 435

Change your xPath expression to //tag2/comment().

By only specifying // you're allowing comments for any tag.

Upvotes: 1

Related Questions