Rolf Lussi
Rolf Lussi

Reputation: 615

Python XML parser doesn't get all text

I have following XML source.

<a>
  <b>
     first
  </b>
  second
</a>

I try to parse it with python to get the text out and combine the whole text to one string like firstsecond. For this I have the following script

import xml.etree.ElementTree as ET

top = ET.fromstring(myXml)
for a in top.iter('a'):
  s = ''
  if a.text:
    s += a.text
  else:
    for b in a.iter('b'):
      if b.text:
        s += b.text
  print s

But the script just prints the first element first. The second somehow seems to get lost. It works when I have both strings in <a></a> or both in <b></b>.

<a>
  firstsecond
</a>

Prints firstsecond

<a>
  <b>
     first
  </b>
  <b>
     second
  </b>
</a>

Prints firstsecond

Am I missing something to get out the second string when its in the same <a></a> as the <b></b>? Or is this just not possible with etree and I have to repack it? The XML is given, I won't be able to change the source therefore.

Thanks for any help.

Upvotes: 1

Views: 1303

Answers (3)

aBiologist
aBiologist

Reputation: 2027

How about this one, I tested it on your xml file:

import xml.etree.ElementTree as ET
x = 'xml.xml' # your xml file
tree = ET.parse(x)
root = tree.getroot()
string = ""
for c in root:
 string +=  c.text.strip()
print string

output:

 firstsecond

Upvotes: 0

Mark Tolonen
Mark Tolonen

Reputation: 177674

b.tail will contain second in your first example. Text after an end tag is considered tail in ElementTree. Actually it will contain the whitespace as well and be more like \n second\n.

Consider a nicely formatted data block of XML:

<a>
  <b>first</b>
  <b>second</b>
</a>

Here you will get data fields in b.text and whitespace formatting in tail, which can easily be ignored.

Upvotes: 3

Rolf Lussi
Rolf Lussi

Reputation: 615

I found a way to simplify it with the tostring function.

top = ET.fromstring(myXml)
for a in top.iter('a'):
  s = ET.tostring(a, method='text')
  print s

This function just combines all the texts of the elements and subelements

Upvotes: 1

Related Questions