Pantagrool
Pantagrool

Reputation: 63

Bookending a node with text with Python's elementtree

I'm trying to add text bookends to strings in an XML file. If a string has already been translated before, I want to add a @@@ and ### to the beginning and end of the string for further processing. The end result would look like this:

<group>
    <seg-source>
        <mrk mid="1" mtype="seg">I have a <g id="157">red</g> pen.</mrk>
    </seg-source>
    <target>
        <mrk mid="1" mtype="seg">@@@J'ai un stylo <g id="157">rouge</g>.###</mrk>
    </target>
</group>

I tried before using xml.minidom and created a generic text node such as start_tag = xmldoc.createTextNode(u'@@@'), and was able to insert/append the nodes as child nodes. (I ultimately gave up using minidom for various reasons.)

I was able to convert my script from minidom to elementtree rather quickly, but I'm getting stuck at this most crucial point. I've read and re-read the documentation but I cannot find anything specific to what I need to do, especially because a lot of the <mrk> elements have sub-elements, such as the <g> tag in the example. Also, sometimes the first thing in a <mrk> node may not be a text element, so I just can't replace the text.

The Python code is pretty basic and as you can see, I have place holders for the bookends.

for target in group.iter('target'):
    for mrk in target.iter('mrk'):

        # Adding "@@@" at front of <mrk>
        mrk.insert(0, <magical text-only element here>)

        # Adding "###" to end of <mrk>
        mrk.append(<magical text-only element here>)

Many thanks!

Upvotes: 3

Views: 213

Answers (1)

bjimba
bjimba

Reputation: 928

ElementTree treats text in a very non-XML way. A couple of tricks involved here. The first is that in <a>xxx<b>yyy</b>zzz<c>eee</c>rrr</a>, the way you get to "zzz" is via the tail of the <b> element. (I know, XSLT mavens are gnashing their teeth at this.)

Another trick to use is that you can treat ET Elements as if they were a List of child nodes. So you can use len(root) to get how many children it has (ignoring text nodes).

Here's a quick sample program that seemed to run when I tried it. You will probably want to tweak it to your needs, but it should get you going.

import xml.etree.ElementTree as ET

xmlin="""
    <group>
        <mrk>I have a red pen.</mrk>
        <mrk>I have a <g id="157">red</g> pen.</mrk>
        <mrk><xyzzy>Hey!</xyzzy> I have a <g>red</g> pen.</mrk>
        <mrk>There is text <and>this</and></mrk>
    </group>
"""

root = ET.fromstring(xmlin)

for mrk in root:
    if (mrk.text == None):
        mrk.text = "@@@"
    else:
        mrk.text = "@@@" + mrk.text

    # do we have children?
    if (len(mrk) == 0):
        mrk.text = mrk.text + "###"
    else:
        last = mrk[len(mrk)-1]
        if (last.tail == None):
            last.tail = "###"
        else:
            last.tail = last.tail + "###"

print('ET.tostring(root)')
print ET.tostring(root)

Upvotes: 2

Related Questions