Extract value with XPath, etree and python

Question

I try to extract a value with XPath, Python and etree. I have no influence on the .xml file I receive and I think it seems to be somehow invalid.

My method already extracts the text node object I want to examine.

# This is the tag.
textTag = lastExportTree.xpath("//TEXT_NODE[@PROPERTY = '%s']/TEXT[@ID = '%s']" % (key, id[1]))

# This is a part of the xml. I already have the text node I want to examine.
CALBUYHiho

Stack
Overflow

If I want to access the content of ID="1003" I only have to type:

print(textTag.text); # Will print 'Stack'

But the tag with ID="1001" also includes the SYSTEMMESSAGE Tag. How can I access the content 'HiHo'? (textTag.text won't work!) Is this invalid xml what I receive?

Thank you a lot for your answer!

Dan Lecocq · Accepted Answer

I've encountered this problem before as well, and this is what we ended up with. In our case we were interested in finding the text in all the non-script and non-style children of an element.

# Just to pre-compile our XPath. This will get all the text from this element from
# each of the child elements that aren't 'script' or 'style'
textXpath = etree.XPath(
    '(.|.//*[not(name()="script")][not(name()="style")])/text()')

# If instead you don't want to include the current element:
# textXpath = etree.XPath(
#   './/*[not(name()="script")][not(name()="style")]/text()')

results = ''.join(textXpath(textTag))

It might not be the prettiest chunk of code, but it's what we've resorted to.

Extract value with XPath, etree and python

Answers (2)

Related Questions