mmmdreg
mmmdreg

Reputation: 6618

Iterating multiple (parent,child) nodes using Python ElementTree

The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than children.

Consider my xml is of the form:

<Content>
  <Para>first</Para>
  <Table><Para>second</Para></Table>
  <Para>third</Para>
</Content>

The following finds all "Para" nodes without considering parents:

(1) paras = [p for p in page.getiterator("Para")]

This (adapted from effbot) stores the parent by looping over them instead of the child nodes:

(2) paras = [(c,p) for p in page.getiterator() for c in p]

This makes perfect sense, and can be extended with a conditional to achieve the (supposedly) same result as (1), but with parent info added:

(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]

The ElementTree documentation suggests that the getiterator() method does a depth-first search. Running it without looking for the parent (1) yields:

first
second
third

However, extracting the text from paras in (3), yields:

first, Content>Para
third, Content>Para
second, Table>Para

This appears to be breadth-first.

This therefore raises two questions.

  1. Is this correct and expected behaviour?
  2. How do you extract (parent, child) tuples when the child must be of a certain type but the parent can be anything, if document order must be maintained. I do not think running two loops and mapping the (parent,child)'s generated by (3) to the orders generated by (1) is ideal.

Upvotes: 4

Views: 6629

Answers (1)

John Machin
John Machin

Reputation: 83032

Consider this:

>>> xml = """<Content>
...   <Para>first</Para>
...   <Table><Para>second</Para></Table>
...   <Para>third</Para>
... </Content>"""
>>> import xml.etree.cElementTree as et
>>> page = et.fromstring(xml)
>>> for p in page.getiterator():
...     print "ppp", p.tag, repr(p.text)
...     for c in p:
...         print "ccc", c.tag, repr(c.text), p.tag
...
ppp Content '\n  '
ccc Para 'first' Content
ccc Table None Content
ccc Para 'third' Content
ppp Para 'first'
ppp Table None
ccc Para 'second' Table
ppp Para 'second'
ppp Para 'third'
>>> 

Aside: list comprehensions are magnificent until you want to see exactly what is being iterated over :-)

getiterator is producing the "ppp" elements in the advertised order. However you are plucking your elements of interest out of the subsidiary "ccc" elements, which are not in your desired order.

One solution is to do your own iteration:

>>> def process(elem, parent):
...    print elem.tag, repr(elem.text), parent.tag if parent is not None else None
...    for child in elem:
...       process(child, elem)
...
>>> process(page, None)
Content '\n  ' None
Para 'first' Content
Table None Content
Para 'second' Table
Para 'third' Content
>>>

Now you can snarf "Para" elements each with a reference to its parent (if any) as they stream past.

This can be wrapped up nicely in a generator gadget:

>>> def iterate_with_parent(elem):
...     stack = []
...     while 1:
...         for child in reversed(elem):
...             stack.append((child, elem))
...         if not stack: return
...         elem, parent = stack.pop()
...         yield elem, parent
...
>>>
>>> showtag = lambda e: e.tag if e is not None else None
>>> showtext = lambda e: repr((e.text or '').rstrip())
>>> for e, p in iterate_with_parent(page):
...     print e.tag, showtext(e), showtag(p)
...
Para 'first' Content
Table '' Content
Para 'second' Table
Para 'third' Content
>>>

Upvotes: 5

Related Questions