johan
johan

Reputation: 824

Replacement of ampersands that are part of a numeric character reference by Python's elementtree

I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.

However, once I use elementtree's tostring all ampersands in the character references are replaced by &. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.

After some searching I found this: elementtree and entities

However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?

The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Ström"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&amp;': <rubbish>Str&amp;#246;m</rubbish>

main()

If anyone has any ideas or suggestions that would be great!

Upvotes: 4

Views: 3387

Answers (2)

johan
johan

Reputation: 824

Short update to the above: I just had another critical look at my code, and realised there's an even simpler solution (largely based on @Duncan's answer) that at least works for me.

In my original code I was using the entity references in order to get an ASCII representation of some Latin-15 encoded text (which I was reading from a binary file). So the someText variable above actually started its life as a bytes object, which was subsequently decoded to Latin-15 text, and finally transformed to ASCII.

Thanks to @Duncan and @Inerdial I now know that ElementTree can do the Latin-15 to ASCII conversion by itself. After some experimenting I managed to come up with a solution that is stupidly simple to the extent of being almost trivial. However, I imagine that it just might be useful to some, so I decided to share it here anyway:

import sys
import xml.etree.ElementTree as ET

def main():
    # Bytes object
    someBytes=b'Str\xf6m'

    # Decode to Latin-15
    someText=someBytes.decode('iso-8859-15','strict')

    # Create element object
    testElement=ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text=someText

    # Convert element to xml-formatted text string 
    testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')

    print(testElementAsString)

main()

Note that I added the final .decode("ascii") in order to make this work with Python 3 (which, unlike Python 2.7, returns testElementAsString as a bytes object).

Thanks again to @Duncan, @Inerdial and @Tomalak for pointing me in the right direction, and @Rik Poggi for correcting the formatting in my original post!

Upvotes: 2

Duncan
Duncan

Reputation: 95712

You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Str&#246;m")
print(someText)

Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.

Upvotes: 3

Related Questions