Reputation: 477
I am trying to read the following XML file which has following content:
<tu creationdate="20100624T160543Z" creationid="SYSTEM" usagecount="0">
<prop type="x-source-tags">1=A,2=B</prop>
<prop type="x-target-tags">1=A,2=B</prop>
<tuv xml:lang="EN">
<seg>Modified <ut x="1"/>Denver<ut x="2"/> Score</seg>
</tuv>
<tuv xml:lang="DE">
<seg>Modifizierter <ut x="1"/>Denver<ut x="2"/>-Score</seg>
</tuv>
</tu>
using the following code
tree = ET.parse(tmx)
root = tree.getroot()
seg = root.findall('.//seg')
for n in seg:
print(n.text)
It gave the following output:
Modified
Modifizierter
What I am expecting was
Modified Denver Score
Modifizierter Denver -Score
Can someone explain why only part of seg is displayed?
Upvotes: 1
Views: 264
Reputation: 50947
You need to be aware of the tail
property, which is the text that follows an element's end tag. It is explained well here: http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html.
"Denver" is the tail
of the first <ut>
element and " Score" is the tail
of the second <ut>
element. These strings are not part of the text
of the <seg>
element.
In addition to the solution provided by kgbplus (which works with both ElementTree and lxml), with lxml you can also use the following methods to get the wanted output:
for n in seg:
print("".join(n.xpath("text()")))
for n in seg:
print("".join(n.itertext()))
Upvotes: 2
Reputation: 842
You can use tostring
function:
tree = ET.parse(tmx)
root = tree.getroot()
seg = root.findall('.//seg')
for n in seg:
print(ET.tostring(n, method="text"))
In your case resulting string may contain unnecessary symbols, so you can modify last line like this:
print(ET.tostring(n, method="text").strip())
Upvotes: 1