Reputation: 1122
I'm working with XML files in python. I have a dataset containing sentences in several languages, and is structured like this:
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
What i want to get is, starting from the dataset, a new dataset containing only the sentences and the annotations in english ("en" value of the attribute "lang"). I tried this solution:
import xml.etree.ElementTree as ET
tree = ET.parse('samplefile2.xml')
root = tree.getroot()
for sentence in root:
if sentence.tag == 'sentence':
for txt in sentence:
if txt.tag == 'text':
if txt.attrib['lang'] != 'en':
sentence.remove(txt)
if txt.tag == 'annotations':
for annotation in txt:
if annotation.attrib['lang'] != 'en':
txt.remove(annotation)
tree.write('output.xml')
But it seems to work only on the level of the text
attribute, not on the level of the annotation
attribute. I tried even replacing in the python side of the solution elements like sentence, txt, annotation
with incremental indexes root[s], root[s][t], root[s][t][a]
, but it sorts no effect. Furthermore, the python code i provided inserts randomly in the xml file (honestly i don't know if this could be helpfull to solve this issue) strings like δημιουργία
.
So, I strongly believe that the problem is in the nested tags, but I can't figure it out. Some ideas?
Upvotes: 3
Views: 739
Reputation: 52848
If you're able to use lxml, I think this would be easier using xpath...
XML Input (input.xml)
<corpus>
<sentence id="0">
<text lang="de">...</text>
<text lang="en">...</text>
<text lang="fr">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="de">...</annotation>
<annotation lang="en">...</annotation>
<annotation lang="fr">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
Python
from lxml import etree
target_lang = "en"
tree = etree.parse("input.xml")
# Match any element that has a child that has a lang attribute with a value other than
# target_lang. We need this element so we can remove the child from it.
for parent in tree.xpath(f".//*[*[@lang != '{target_lang}']]"):
# Match the children that have a lang attribute with a value other than target_lang.
for child in parent.xpath(f"*[@lang != '{target_lang}']"):
# Remove the child from the parent.
parent.remove(child)
tree.write("output.xml")
XML Output (output.xml)
<corpus>
<sentence id="0">
<text lang="en">...</text>
<!-- Other languages -->
<annotations>
<annotation lang="en">...</annotation>
<!-- Other languages -->
</annotations>
</sentence>
<sentence id="1">
<!-- Other sentence -->
</sentence>
<!-- Other sentences -->
</corpus>
Upvotes: 1