Reputation: 873
I have some XML that runs through an NLP processor. I have to modify the output in a Python script, so no XSLT for me. I'm trying to extract the raw text all within <TXT>
and </TXT>
as a string from my XML but I'm stuck on how to pull this from ElementTree.
My code up to this point is
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml
and from there I want to extract everything within TXT as a string stripped of tags. It must be a string for some other processes further down the line. i'd like to look like output_txt
below.
output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."
I imagine this should be fairly easy and straightforward, but I just can't figure it out. I tried using this solution but I got AttributeError: 'ElementTree' object has no attribute 'itertext'
and it would strip all tags in the xml rather just between <TXT>
and </TXT>
.
Upvotes: 3
Views: 1163
Reputation: 52888
Normally I'd use plain XPath to do this:
normalize-space(//TXT)
However, the XPath support in ElementTree is limited so you'd only be able to do this in lxml.
To do it in ElementTree, I'd do it similar to the answer you linked to in your question; force it to plain text with tostring
using method="text"
. You'd also want to normalize the whitespace.
Example...
import xml.etree.ElementTree as ET
xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
<DOC>
<DOCID>112233</DOCID>
<TXT>
<S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
<S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
</TXT>
</DOC>
</NORMDOC>
"""
tree = ET.fromstring(xml_doc)
txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)
Printed output...
George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.
Upvotes: 3