Zuriar
Zuriar

Reputation: 11734

Parsing XML with XPath in Python 3

I have the following xml:

<document>
  <internal-code code="201">
    <internal-desc>Biscuits Wrapped</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits (Wrapped)</web-sub-category>
  </internal-code>
  <internal-code code="202">
    <internal-desc>Biscuits Sweet</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits (Sweets)</web-sub-category>
  </internal-code>
  <internal-code code="221">
    <internal-desc>Biscuits Savoury</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits For Cheese</web-sub-category>
  </internal-code>
  ....
</document>

I have loaded it into a tree using this code:

try:
  groups = etree.parse(PRODUCT_GROUPS_XML_FILEPATH)
  root = groups.getroot()
  internalGroup = root.findall("./internal-code")
  LOG.append("[INFO] product groupings file loaded and parsed ok")
except Exception as e:
  LOG.append("[ERROR] PRODUCT GROUPINGS XML FILE ACCESS PROBLEM")
  LOG.append("[***TERMINATED***]")
  writelog()
  exit()

I would like to use XPath to find the correct and then be able to access the child nodes of that group. So if I am searching for internal-code 221 and want web-category I would do something like:

internalGroup.find("internal-code", 221).get("web-category").text

I am not experienced with XML and Python and I have been staring at this for ages. All help very gratefully received. Thanks

Upvotes: 7

Views: 12960

Answers (2)

Steven
Steven

Reputation: 28666

While I'm a big fan of lxml (see falsetru's answer), which you would need for full xpath support, the standard library's elementtree implementation does support enough to get what you need:

root.findtext('.//internal-code[@code="221]/web-category')

This returns the text property of the first matching element, which is enough if you are sure that code 221 will only occur once. If there could be more and you need a list:

[i.text for i in root.findall('.//internal-code[@code="221"]/web-category')]

(note that these examples would also work in lxml)

Upvotes: 2

falsetru
falsetru

Reputation: 369074

According to xml.etree.ElementTree documentation:

XPath support

This module provides limited support for XPath expressions for locating elements in a tree. The goal is to support a small subset of the abbreviated syntax; a full XPath engine is outside the scope of the module.

Use lxml:

>>> import lxml.etree as ET
>>>
>>> s = '''
... <document>
...   <internal-code code="201">
...     <internal-desc>Biscuits Wrapped</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits (Wrapped)</web-sub-category>
...   </internal-code>
...   <internal-code code="202">
...     <internal-desc>Biscuits Sweet</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits (Sweets)</web-sub-category>
...   </internal-code>
...   <internal-code code="221">
...     <internal-desc>Biscuits Savoury</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits For Cheese</web-sub-category>
...   </internal-code>
... </document>
... '''
>>>
>>> root = ET.fromstring(s)
>>> for text in root.xpath('.//internal-code[@code="221"]/web-category/text()'):
...     print(text)
...
Biscuits

Upvotes: 2

Related Questions