Reputation: 201
I have a .tmx file, and I want to extract the text from the seg tag, however because of the inside tags such as bpt and ept, I cannot extract this text. So I would like to remove the bpt tag completely. I tried .remove() method. However, this also removes the text.
I cannot use BeautifulSoup because my original file is .tmx
Upvotes: 1
Views: 1827
Reputation: 338128
ElementTree does not keep parent references in the XML tree. That's inconvenient but not the end of the world.
But in order to delete any node in an XML document, you need to delete it from its parent, so you need a way to get the parent node.
Easiest for ElementTree is to iterate all potential parents and then check each parent if it has a child you want to delete.
Assuming <bpt>
is always a child of <seg>
, that would mean iterating the <seg>
elements:
for node in root.iter('seg'):
prev = None
for child in list(node):
if child.tag == 'bpt':
# retain child node's tail, if any
if child.tail is not None:
if prev is None:
node.text = (node.text if node.text else '') + child.tail
else:
prev.tail = (prev.tail if prev.tail else '') + child.tail
node.remove(child)
else:
prev = child
If <bpt>
could be anywhere, changing the above to for node in root.iter():
iterates all nodes.
Explanation
ElementTree sub-divides the document tree in a very proprietary manner. One main drawback is that there are no "parent" references - relative navigation between nodes is very limited in general - another is that there are no text nodes.
Instead of being a stand-alone node, any text after an element (i.e. text directly following the closing </tag>
) becomes a property of that element, called .tail
:
<!-- <bpt> elements and their "tails" -->
<seg><bpt i="1">{\\f3 </bpt>Cover page <ept i="1">}</ept><bpt i="2">{\\f2 </bpt>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- -----------------------^^^^^^^^^^^ -----------------------^^^^^^^^^^^^^^^^^^^ -->
Consequently, if we remove the <bpt>
element, the tail
is lost, too. In order to save it, we must add the content to the preceding element's tail
(as with "U1 - Insert graphic", which now belongs to the <ept>
), or if there is no preceding element, to the parent element's text
(as with "Cover page ", which now belongs to the <seg>
):
<!-- <bpt> elements removed, "tails" moved one to the front -->
<seg>Cover page <ept i="1">}</ept>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- ^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ -->
Repeating the same removal process with <ept>
would lead to the follwing - all "tails" are now merged into <seg>
's text:
<seg>Cover page U1 - Insert graphic</seg>
<!-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -->
Upvotes: 1