Reputation: 343
I'm trying to port some code to Python 3 that passes a parser created by the xml.sax.make_parser
function as a second argument to xml.dom.minidom.parseString
to parse an XML document.
In Python 3 the parser seems to be unable to parse a XML document as bytes
, but I can't know the encoding of the XML document before parsing it. To demonstrate:
import xml.sax
import xml.dom.minidom
def try_parse(input, parser=None):
try:
xml.dom.minidom.parseString(input, parser)
except Exception as ex:
print(ex)
else:
print("OK")
euro = u"\u20AC" # U+20AC EURO SIGN
xml_utf8 = b"<?xml version=\"1.0\" encoding=\"utf-8\"?>"
xml_cp1252 = b"<?xml version=\"1.0\" encoding=\"windows-1252\"?>"
test_cases = [
b"<a>" + euro.encode("utf-8") + b"</a>",
u"<a>" + euro + u"</a>",
xml_utf8 + b"<a>" + euro.encode("utf-8") + b"</a>",
xml_cp1252 + b"<a>" + euro.encode("cp1252") + b"</a>",
]
for i, case in enumerate(test_cases, 1):
print("%d: %r" % (i, case))
try_parse(case)
try_parse(case, xml.sax.make_parser())
Python 2:
1: '<a>\xe2\x82\xac</a>'
OK
OK
2: u'<a>\u20ac</a>'
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
3: '<?xml version="1.0" encoding="utf-8"?><a>\xe2\x82\xac</a>'
OK
OK
4: '<?xml version="1.0" encoding="windows-1252"?><a>\x80</a>'
OK
OK
Python 3:
1: b'<a>\xe2\x82\xac</a>'
OK
initial_value must be str or None, not bytes
2: '<a>€</a>'
OK
OK
3: b'<?xml version="1.0" encoding="utf-8"?><a>\xe2\x82\xac</a>'
OK
initial_value must be str or None, not bytes
4: b'<?xml version="1.0" encoding="windows-1252"?><a>\x80</a>'
OK
initial_value must be str or None, not bytes
As you can see, the default parser is able to handle the bytes
just fine, but I need the SAX parser to handle parameter entities. Is there any solution to this problem (other than trying to guess the encoding of the bytes
before parsing)?
Upvotes: 3
Views: 1917
Reputation: 343
I seem to have found the cause of the problem. xml.dom.minidom.parseString
calls xml.dom.pulldom.parseString
if a parser is supplied (via _do_pulldom_parse
) which then tries to construct a StringIO
to hold the XML document while parsing. Swapping out that StringIO
for a BytesIO
solves the problem, so I guess I will use the following as a workaround:
from io import StringIO, BytesIO
def parseMaybeBytes(string, parser):
bufsize = len(string)
stream_class = BytesIO if isinstance(string, bytes) else StringIO
buf = stream_class(string)
return xml.dom.pulldom.DOMEventStream(buf, parser, bufsize)
def parseString(string, parser=None):
"""Parse a file into a DOM from a string."""
if parser is None:
return xml.dom.minidom.parseString(string)
return xml.dom.minidom._do_pulldom_parse(parseMaybeBytes, (string,),
{'parser': parser})
Upvotes: 1