Replacement of ampersands that are part of a numeric character reference by Python's elementtree

Question

I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.

However, once I use elementtree's tostring all ampersands in the character references are replaced by &. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.

After some searching I found this: elementtree and entities

However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?

The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):

import sys
import xml.etree.ElementTree as ET

def main():
    # Text string that contains numeric character reference
    someText = "Ström"

    # Create element object
    testElement = ET.Element('rubbish')

    # Add someText to element's text attribute
    testElement.text = someText

    # Convert element to xml-formatted text string 
    testElementAsString = ET.tostring(testElement,'ascii', 'xml')

    print(testElementAsString)

   # Result: ampersand replaced with '&': Str&#246;m

main()

If anyone has any ideas or suggestions that would be great!

Duncan · Accepted Answer

You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.

import re
try:
    from htmlentitydefs import name2codepoint
except ImportError:
    # Must be Python 3.x
    from html.entities import name2codepoint
    unichr = chr

name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")

EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')

def decodeEntities(s, encoding='utf-8'):
    def unescape(match):
        code = match.group(1)
        if code:
            return unichr(int(code, 10))
        else:
            code = match.group(2)
            if code:
                return unichr(int(code, 16))
            else:
                code = match.group(3)
                if code in name2codepoint:
                    return unichr(name2codepoint[code])
        return match.group(0)

    return EntityPattern.sub(unescape, s.decode(encoding))

someText = decodeEntities(b"Ström")
print(someText)

Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.

Replacement of ampersands that are part of a numeric character reference by Python's elementtree

Answers (2)

Related Questions

Replacement of ampersands that are part of a numeric character reference by Python&#39;s elementtree

Answers (2)

Related Questions

Replacement of ampersands that are part of a numeric character reference by Python's elementtree