Reputation: 824
I'm using Python's elementtree module for writing some XML (I'm using Python 2.7 and 3.2). The text fields of some of my elements contain numeric character references.
However, once I use elementtree's tostring
all ampersands in the character references are replaced by &
. Apparently elementtree or the underlying parser do not recognise that the ampersands here are part of a numeric character reference.
After some searching I found this: elementtree and entities
However, I'm not keen on this either, as in my current code I foresee that this may end up causing problems of its own. Other than that I found surprisingly little on this, so maybe I'm simply overlooking something obvious?
The following simple test code illustrates the problem (tested using Python 2.7 and 3.2):
import sys
import xml.etree.ElementTree as ET
def main():
# Text string that contains numeric character reference
someText = "Ström"
# Create element object
testElement = ET.Element('rubbish')
# Add someText to element's text attribute
testElement.text = someText
# Convert element to xml-formatted text string
testElementAsString = ET.tostring(testElement,'ascii', 'xml')
print(testElementAsString)
# Result: ampersand replaced with '&': <rubbish>Str&#246;m</rubbish>
main()
If anyone has any ideas or suggestions that would be great!
Upvotes: 4
Views: 3387
Reputation: 824
Short update to the above: I just had another critical look at my code, and realised there's an even simpler solution (largely based on @Duncan's answer) that at least works for me.
In my original code I was using the entity references in order to get an ASCII representation of some Latin-15 encoded text (which I was reading from a binary file). So the someText
variable above actually started its life as a bytes object, which was subsequently decoded to Latin-15 text, and finally transformed to ASCII.
Thanks to @Duncan and @Inerdial I now know that ElementTree can do the Latin-15 to ASCII conversion by itself. After some experimenting I managed to come up with a solution that is stupidly simple to the extent of being almost trivial. However, I imagine that it just might be useful to some, so I decided to share it here anyway:
import sys
import xml.etree.ElementTree as ET
def main():
# Bytes object
someBytes=b'Str\xf6m'
# Decode to Latin-15
someText=someBytes.decode('iso-8859-15','strict')
# Create element object
testElement=ET.Element('rubbish')
# Add someText to element's text attribute
testElement.text=someText
# Convert element to xml-formatted text string
testElementAsString=ET.tostring(testElement,'ascii', 'xml').decode('ascii')
print(testElementAsString)
main()
Note that I added the final .decode("ascii")
in order to make this work with Python 3 (which, unlike Python 2.7, returns testElementAsString
as a bytes object).
Thanks again to @Duncan, @Inerdial and @Tomalak for pointing me in the right direction, and @Rik Poggi for correcting the formatting in my original post!
Upvotes: 2
Reputation: 95712
You need to decode the character references in your input. Here's a function that will decode both numeric character references and html named references; it accepts a byte string as input and returns unicode. The code below works for Python 2.7 or 3.x.
import re
try:
from htmlentitydefs import name2codepoint
except ImportError:
# Must be Python 3.x
from html.entities import name2codepoint
unichr = chr
name2codepoint = name2codepoint.copy()
name2codepoint['apos']=ord("'")
EntityPattern = re.compile('&(?:#(\d+)|(?:#x([\da-fA-F]+))|([a-zA-Z]+));')
def decodeEntities(s, encoding='utf-8'):
def unescape(match):
code = match.group(1)
if code:
return unichr(int(code, 10))
else:
code = match.group(2)
if code:
return unichr(int(code, 16))
else:
code = match.group(3)
if code in name2codepoint:
return unichr(name2codepoint[code])
return match.group(0)
return EntityPattern.sub(unescape, s.decode(encoding))
someText = decodeEntities(b"Ström")
print(someText)
Of course, if you can avoid getting the character reference in the string to begin with that will make your life somewhat easier.
Upvotes: 3