Python/ElementTree: Parsing inline elements w/ respect to surrounding text?

Question

I need to parse some XML containing inline elements. The XML look, for example, like this:


Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar.

If I iterate now over this structure with for elem in list(parent): ... I only get access to fref. If I now process fref, the surrounding text is of course lost, since text isn't a real element.

Does anybody know of a way to properly address this issue?

Lev Levitsky · Accepted Answer

The following shows how to achieve this with lxml.

>>> from lxml.etree import fromstring
>>> tree = fromstring(''' Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar. ''')
>>> elem = tree.xpath('/section/fref')[0]
>>> elem.text
'fubare'
>>> elem.tail
'. And yet more fubar. '
>>> elem.getparent().text
" Fubar, I'm so fubar, fubar and even more "

From lxml.etree tutorial:

If you want to read only the text, i.e. without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order. Again, the tostring() function comes to the rescue, this time using the method keyword:

>>> from lxml.etree import tostring
>>> tostring(html, method="text")
" Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar. "

There's also an XPath way to do this, it's described in the linked page.

Python/ElementTree: Parsing inline elements w/ respect to surrounding text?

Answers (1)

Related Questions