Reputation: 7486
I need to parse some XML containing inline elements. The XML look, for example, like this:
<section>
Fubar, I'm so fubar, fubar and even more <fref bar="baz">fubare</fref>. And yet more fubar.
</section>
If I iterate now over this structure with for elem in list(parent): ...
I only get access to fref. If I now process fref, the surrounding text is of course lost, since text isn't a real element.
Does anybody know of a way to properly address this issue?
Upvotes: 5
Views: 1105
Reputation: 65841
The following shows how to achieve this with lxml
.
>>> from lxml.etree import fromstring
>>> tree = fromstring('''<section> Fubar, I'm so fubar, fubar and even more <fref bar="baz">fubare</fref>. And yet more fubar. </section>''')
>>> elem = tree.xpath('/section/fref')[0]
>>> elem.text
'fubare'
>>> elem.tail
'. And yet more fubar. '
>>> elem.getparent().text
" Fubar, I'm so fubar, fubar and even more "
From lxml.etree
tutorial:
If you want to read only the text, i.e. without any intermediate tags, you have to recursively concatenate all text and tail attributes in the correct order. Again, the tostring() function comes to the rescue, this time using the method keyword:
>>> from lxml.etree import tostring
>>> tostring(html, method="text")
" Fubar, I'm so fubar, fubar and even more fubare. And yet more fubar. "
There's also an XPath way to do this, it's described in the linked page.
Upvotes: 5