SU3
SU3

Reputation: 5409

encode and decode bytes in xml strings

I'm looking for a way to write and read byte data in xml files. I'd like the xml files to be human readable, so I'd like to avoid base64 encoding or anything similar. I figured I could do something like this. If I have a string b'abc < ABC\x04&' that needs to go into a tag <node>, then I'd write that as

<node>abc &lt; ABC&#x04;&amp;</node>

Is there a way to make this kind of encoding work with any xml library in python3? I'd prefer lxml, but that's not a requirement.

Clarification: When I write an xml file, the strings are initially of type bytes, e.g. b'abc < ABC\x04&'. In a lot of cases they only contain alphanumeric ascii characters, which I want written to xml as such. Other bytes I want to encode as hex values, so they can still be easily understood. And I'd like to encode characters like > and & as &gt; and &amp; (or else also as hex values) to avoid using <![CDATA[<]]>. When I read the strings, I'd like them to be converted back to b'...', if possible.

Upvotes: 1

Views: 5811

Answers (1)

Joran Beasley
Joran Beasley

Reputation: 114088

Im pretty sure there is no builtins that accomplish exactly what you ask

I think the best you could do is just iterate over the characters and "fix" each one (see example that i think is complete)

try: # python2
  from htmlentitydefs import codepoint2name
except: # python3
  from html.entities import codepoint2name

def encode_xml(c):
  # return the character or its &#XX; or &entity; representation
  ascii_val = ord(c)
  known_entity =  codepoint2name.get(ascii_val,None)
  if known_entity: # this is a named codepoint
    return "&%s;"%(known_entity,)  
  # printable characters are ascii values [32..127] inclusive
  is_normal_character =  32 <= ascii_val <= 127
  if is_normal_character:
      return c
  return hex(ascii_val).replace("0x","&#")+";"


def make_xml_entity_string(s):
  return "".join(encode_xml(c) for c in s)

print("R:", make_xml_entity_string( 'abc < ABC\x14\xF2&'))

you could then go the otherway ... in roughly the same manner (taking advantage of regex this time though)

try: # python2  
  from htmlentitydefs import name2codepoint
except: # python3
  from html.entities import name2codepoint
import re

def decode_xml_replacer(match):
  name=match.group(1);
  if(name.startswith("#")):
    return chr(int(name[1:],16))
  return chr(name2codepoint.get(name,'?'))

def decode_xml_string(s):
  return re.sub("&(.*?);",decode_xml_replacer,s)

... note that this wont work for codepoints > 255 i think

Upvotes: 1

Related Questions