Reputation: 615
I have following XML source.
<a>
<b>
first
</b>
second
</a>
I try to parse it with python to get the text out and combine the whole text to one string like firstsecond
. For this I have the following script
import xml.etree.ElementTree as ET
top = ET.fromstring(myXml)
for a in top.iter('a'):
s = ''
if a.text:
s += a.text
else:
for b in a.iter('b'):
if b.text:
s += b.text
print s
But the script just prints the first element first
. The second somehow seems to get lost. It works when I have both strings in <a></a>
or both in <b></b>
.
<a>
firstsecond
</a>
Prints firstsecond
<a>
<b>
first
</b>
<b>
second
</b>
</a>
Prints firstsecond
Am I missing something to get out the second string when its in the same <a></a>
as the <b></b>
? Or is this just not possible with etree and I have to repack it? The XML is given, I won't be able to change the source therefore.
Thanks for any help.
Upvotes: 1
Views: 1303
Reputation: 2027
How about this one, I tested it on your xml file:
import xml.etree.ElementTree as ET
x = 'xml.xml' # your xml file
tree = ET.parse(x)
root = tree.getroot()
string = ""
for c in root:
string += c.text.strip()
print string
output:
firstsecond
Upvotes: 0
Reputation: 177674
b.tail
will contain second
in your first example. Text after an end tag is considered tail
in ElementTree. Actually it will contain the whitespace as well and be more like \n second\n
.
Consider a nicely formatted data block of XML:
<a>
<b>first</b>
<b>second</b>
</a>
Here you will get data fields in b.text
and whitespace formatting in tail
, which can easily be ignored.
Upvotes: 3
Reputation: 615
I found a way to simplify it with the tostring
function.
top = ET.fromstring(myXml)
for a in top.iter('a'):
s = ET.tostring(a, method='text')
print s
This function just combines all the texts of the elements and subelements
Upvotes: 1