carousallie
carousallie

Reputation: 873

Create raw text from XML tags

I have some XML that runs through an NLP processor. I have to modify the output in a Python script, so no XSLT for me. I'm trying to extract the raw text all within <TXT> and </TXT> as a string from my XML but I'm stuck on how to pull this from ElementTree.

My code up to this point is

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml

and from there I want to extract everything within TXT as a string stripped of tags. It must be a string for some other processes further down the line. i'd like to look like output_txt below.

output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."

I imagine this should be fairly easy and straightforward, but I just can't figure it out. I tried using this solution but I got AttributeError: 'ElementTree' object has no attribute 'itertext' and it would strip all tags in the xml rather just between <TXT> and </TXT>.

Upvotes: 3

Views: 1163

Answers (1)

Daniel Haley
Daniel Haley

Reputation: 52888

Normally I'd use plain XPath to do this:

normalize-space(//TXT)

However, the XPath support in ElementTree is limited so you'd only be able to do this in lxml.

To do it in ElementTree, I'd do it similar to the answer you linked to in your question; force it to plain text with tostring using method="text". You'd also want to normalize the whitespace.

Example...

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.fromstring(xml_doc)

txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)

Printed output...

George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.

Upvotes: 3

Related Questions