Create raw text from XML tags

Question

I have some XML that runs through an NLP processor. I have to modify the output in a Python script, so no XSLT for me. I'm trying to extract the raw text all within and as a string from my XML but I'm stuck on how to pull this from ElementTree.

My code up to this point is

import xml.etree.ElementTree as ET

xml_doc = """

   
      112233
      
        George Washington and Thomas Jefferson were both founding fathers.
        Thomas Jefferson has a social security number of 222-22-2222.
      
   

"""

tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml

and from there I want to extract everything within TXT as a string stripped of tags. It must be a string for some other processes further down the line. i'd like to look like output_txt below.

output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."

I imagine this should be fairly easy and straightforward, but I just can't figure it out. I tried using this solution but I got AttributeError: 'ElementTree' object has no attribute 'itertext' and it would strip all tags in the xml rather just between and .

Daniel Haley · Accepted Answer

Normally I'd use plain XPath to do this:

normalize-space(//TXT)

However, the XPath support in ElementTree is limited so you'd only be able to do this in lxml.

To do it in ElementTree, I'd do it similar to the answer you linked to in your question; force it to plain text with tostring using method="text". You'd also want to normalize the whitespace.

Example...

import xml.etree.ElementTree as ET

xml_doc = """

   
      112233
      
        George Washington and Thomas Jefferson were both founding fathers.
        Thomas Jefferson has a social security number of 222-22-2222.
      
   

"""

tree = ET.fromstring(xml_doc)

txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)

Printed output...

George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.

Create raw text from XML tags

Answers (1)

Related Questions