Reputation: 5409
I'm looking for a way to write and read byte data in xml files. I'd like the xml files to be human readable, so I'd like to avoid base64 encoding or anything similar. I figured I could do something like this. If I have a string b'abc < ABC\x04&'
that needs to go into a tag <node>
, then I'd write that as
<node>abc < ABC&</node>
Is there a way to make this kind of encoding work with any xml library in python3? I'd prefer lxml
, but that's not a requirement.
Clarification: When I write an xml file, the strings are initially of type bytes
, e.g. b'abc < ABC\x04&'
. In a lot of cases they only contain alphanumeric ascii characters, which I want written to xml as such. Other bytes I want to encode as hex values, so they can still be easily understood. And I'd like to encode characters like >
and &
as >
and &
(or else also as hex values) to avoid using <![CDATA[<]]>
. When I read the strings, I'd like them to be converted back to b'...'
, if possible.
Upvotes: 1
Views: 5811
Reputation: 114088
Im pretty sure there is no builtins that accomplish exactly what you ask
I think the best you could do is just iterate over the characters and "fix" each one (see example that i think is complete)
try: # python2
from htmlentitydefs import codepoint2name
except: # python3
from html.entities import codepoint2name
def encode_xml(c):
# return the character or its &#XX; or &entity; representation
ascii_val = ord(c)
known_entity = codepoint2name.get(ascii_val,None)
if known_entity: # this is a named codepoint
return "&%s;"%(known_entity,)
# printable characters are ascii values [32..127] inclusive
is_normal_character = 32 <= ascii_val <= 127
if is_normal_character:
return c
return hex(ascii_val).replace("0x","&#")+";"
def make_xml_entity_string(s):
return "".join(encode_xml(c) for c in s)
print("R:", make_xml_entity_string( 'abc < ABC\x14\xF2&'))
you could then go the otherway ... in roughly the same manner (taking advantage of regex this time though)
try: # python2
from htmlentitydefs import name2codepoint
except: # python3
from html.entities import name2codepoint
import re
def decode_xml_replacer(match):
name=match.group(1);
if(name.startswith("#")):
return chr(int(name[1:],16))
return chr(name2codepoint.get(name,'?'))
def decode_xml_string(s):
return re.sub("&(.*?);",decode_xml_replacer,s)
... note that this wont work for codepoints > 255 i think
Upvotes: 1