Robert Weindl
Robert Weindl

Reputation: 1092

Extract value with XPath, etree and python

I try to extract a value with XPath, Python and etree. I have no influence on the .xml file I receive and I think it seems to be somehow invalid.

My method already extracts the text node object I want to examine.

# This is the tag.
textTag = lastExportTree.xpath("//TEXT_NODE[@PROPERTY = '%s']/TEXT[@ID = '%s']" % (key, id[1]))

# This is a part of the xml. I already have the text node I want to examine.
<TEXT ID="1001" STATE="5" LOCKED="false"><SYSTEMMESSAGE>CALBUY</SYSTEMMESSAGE>Hiho</TEXT>
<TEXT ID="1002" STATE="1" LOCKED="false"/>
<TEXT ID="1003" STATE="5" LOCKED="false">Stack</TEXT>
<TEXT ID="1004" STATE="1" LOCKED="false">Overflow</TEXT>

If I want to access the content of ID="1003" I only have to type:

print(textTag.text); # Will print 'Stack'

But the tag with ID="1001" also includes the SYSTEMMESSAGE Tag. How can I access the content 'HiHo'? (textTag.text won't work!) Is this invalid xml what I receive?

Thank you a lot for your answer!

Upvotes: 0

Views: 2434

Answers (2)

Dan Lecocq
Dan Lecocq

Reputation: 3493

I've encountered this problem before as well, and this is what we ended up with. In our case we were interested in finding the text in all the non-script and non-style children of an element.

# Just to pre-compile our XPath. This will get all the text from this element from
# each of the child elements that aren't 'script' or 'style'
textXpath = etree.XPath(
    '(.|.//*[not(name()="script")][not(name()="style")])/text()')

# If instead you don't want to include the current element:
# textXpath = etree.XPath(
#   './/*[not(name()="script")][not(name()="style")]/text()')

results = ''.join(textXpath(textTag))

It might not be the prettiest chunk of code, but it's what we've resorted to.

Upvotes: 1

tdelaney
tdelaney

Reputation: 77347

Assuming you are showing us the nodes under lastExportTree, this should do it:

lastExportTree.xpath('TEXT[@STATE="5" and @LOCKED="false" and SYSTEMMESSAGE]/text()')[0]

That says to find all child nodes named TEXT that have the given STATE and LOCKED attributes and a SYSTEMMESSAGE child element.

Upvotes: 0

Related Questions