Get text from mixed element xml tags with ElementTree

Question

I'm using ElementTree to parse an XML document that I have. I am getting the text from the u tags. Some of them have mixed content that I need to filter out or keep as text. Two examples that I have are:


   
     eh
   ¿Sí? 


Pues... 
   
     laugh
   A mí no me suena.

I want to get the text within the vocal tag if it's type is filler but not if it's type is non-ling.

If I iterate through the children of u, somehow the last text bit is always lost. The only way that I can reach it is by using itertext(). But then the chance to check the type of the vocal tag is lost.

How can I parse it so that I get a result like this:

eh ¿Sí? 
Pues... A mí no me suena.

mzjn · Accepted Answer

The lost text bits, "¿Sí?" and "A mí no me suena.", are available as the tail property of each element (the text following the element's end tag).

Here is a way to get the wanted output (tested with Python 2.7).

Assume that vocal.xml looks like this:


  
    
      eh
    ¿Sí? 
  

  Pues... 
     
       laugh
     A mí no me suena.

Code:

from xml.etree import ElementTree as ET

root = ET.parse("vocal.xml") 

for u in root.findall(".//u"):
    v = u.find("vocal")

    if v.get("type") == "filler":
        frags = [u.text, v.findtext("desc"), v.tail]
    else:
        frags = [u.text, v.tail]

    print " ".join(t.encode("utf-8").strip() for t in frags).strip()

Output:

eh ¿Sí?
Pues... A mí no me suena.

Get text from mixed element xml tags with ElementTree

Answers (1)

Related Questions