Reputation: 175
I need to parse an XML which looks like :
<tag>
text1 text2 text3
<some-tag/>
More text
<some-tag/>
Some more text
<some-tag/>
Even more text
</tag>
Using ElementTree's head and tail method, I can get to "text1 text2 text3" and "Even more text".
However, I am unable to come up with a way to reach the text in the middle ("More text" and "Some more text").
Due to the idiosyncrasies of the software generating the XML, I cannot be sure of the stray tags and hence can't use the command find('some-tag').
Is there any way that I can parse this XML using python?
Thanks
Upvotes: 3
Views: 363
Reputation: 11591
More text
and Some more text
are tails of some-tag
. See the following:
>>> import xml.etree.cElementTree as et
>>> text = """<tag>
text1 text2 text3
<some-tag/>
More text
<some-tag/>
Some more text
<some-tag/>
Even more text
</tag>"""
>>> root = et.fromstring(text)
>>> for element in root: # leaving aside the text and tail of root for the moment
print element.tag, ': text =>', element.text or '', 'tail =>', element.tail
some-tag : text => tail => # the tail also has a newline character and white space at its beginning
More text
some-tag : text => tail =>
Some more text
some-tag : text => tail =>
Even more text
Thus you will need to iterate through the children of each element in order to see if the children have any tails.
Upvotes: 3