Reputation: 6618
The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than children.
Consider my xml is of the form:
<Content>
<Para>first</Para>
<Table><Para>second</Para></Table>
<Para>third</Para>
</Content>
The following finds all "Para" nodes without considering parents:
(1) paras = [p for p in page.getiterator("Para")]
This (adapted from effbot) stores the parent by looping over them instead of the child nodes:
(2) paras = [(c,p) for p in page.getiterator() for c in p]
This makes perfect sense, and can be extended with a conditional to achieve the (supposedly) same result as (1), but with parent info added:
(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]
The ElementTree documentation suggests that the getiterator() method does a depth-first search. Running it without looking for the parent (1) yields:
first
second
third
However, extracting the text from paras in (3), yields:
first, Content>Para
third, Content>Para
second, Table>Para
This appears to be breadth-first.
This therefore raises two questions.
Upvotes: 4
Views: 6629
Reputation: 83032
Consider this:
>>> xml = """<Content>
... <Para>first</Para>
... <Table><Para>second</Para></Table>
... <Para>third</Para>
... </Content>"""
>>> import xml.etree.cElementTree as et
>>> page = et.fromstring(xml)
>>> for p in page.getiterator():
... print "ppp", p.tag, repr(p.text)
... for c in p:
... print "ccc", c.tag, repr(c.text), p.tag
...
ppp Content '\n '
ccc Para 'first' Content
ccc Table None Content
ccc Para 'third' Content
ppp Para 'first'
ppp Table None
ccc Para 'second' Table
ppp Para 'second'
ppp Para 'third'
>>>
Aside: list comprehensions are magnificent until you want to see exactly what is being iterated over :-)
getiterator
is producing the "ppp" elements in the advertised order. However you are plucking your elements of interest out of the subsidiary "ccc" elements, which are not in your desired order.
One solution is to do your own iteration:
>>> def process(elem, parent):
... print elem.tag, repr(elem.text), parent.tag if parent is not None else None
... for child in elem:
... process(child, elem)
...
>>> process(page, None)
Content '\n ' None
Para 'first' Content
Table None Content
Para 'second' Table
Para 'third' Content
>>>
Now you can snarf "Para" elements each with a reference to its parent (if any) as they stream past.
This can be wrapped up nicely in a generator gadget:
>>> def iterate_with_parent(elem):
... stack = []
... while 1:
... for child in reversed(elem):
... stack.append((child, elem))
... if not stack: return
... elem, parent = stack.pop()
... yield elem, parent
...
>>>
>>> showtag = lambda e: e.tag if e is not None else None
>>> showtext = lambda e: repr((e.text or '').rstrip())
>>> for e, p in iterate_with_parent(page):
... print e.tag, showtext(e), showtag(p)
...
Para 'first' Content
Table '' Content
Para 'second' Table
Para 'third' Content
>>>
Upvotes: 5