Parsing XML using SAX Parser in Python 3

Question

I'm trying to port some code to Python 3 that passes a parser created by the xml.sax.make_parser function as a second argument to xml.dom.minidom.parseString to parse an XML document.

In Python 3 the parser seems to be unable to parse a XML document as bytes, but I can't know the encoding of the XML document before parsing it. To demonstrate:

import xml.sax
import xml.dom.minidom

def try_parse(input, parser=None):
    try:
        xml.dom.minidom.parseString(input, parser)
    except Exception as ex:
        print(ex)
    else:
        print("OK")

euro = u"\u20AC" # U+20AC EURO SIGN
xml_utf8 = b""
xml_cp1252 = b""

test_cases = [
    b"" + euro.encode("utf-8") + b"",
    u"" + euro + u"",
    xml_utf8 + b"" + euro.encode("utf-8") + b"",
    xml_cp1252 + b"" + euro.encode("cp1252") + b"",
]

for i, case in enumerate(test_cases, 1):
    print("%d: %r" % (i, case))
    try_parse(case)
    try_parse(case, xml.sax.make_parser())

Python 2:

1: '\xe2\x82\xac'
OK
OK
2: u'\u20ac'
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
3: '\xe2\x82\xac'
OK
OK
4: '\x80'
OK
OK

Python 3:

1: b'\xe2\x82\xac'
OK
initial_value must be str or None, not bytes
2: '€'
OK
OK
3: b'\xe2\x82\xac'
OK
initial_value must be str or None, not bytes
4: b'\x80'
OK
initial_value must be str or None, not bytes

As you can see, the default parser is able to handle the bytes just fine, but I need the SAX parser to handle parameter entities. Is there any solution to this problem (other than trying to guess the encoding of the bytes before parsing)?

hackedd · Accepted Answer

I seem to have found the cause of the problem. xml.dom.minidom.parseString calls xml.dom.pulldom.parseString if a parser is supplied (via _do_pulldom_parse) which then tries to construct a StringIO to hold the XML document while parsing. Swapping out that StringIO for a BytesIO solves the problem, so I guess I will use the following as a workaround:

from io import StringIO, BytesIO

def parseMaybeBytes(string, parser):
    bufsize = len(string)
    stream_class = BytesIO if isinstance(string, bytes) else StringIO
    buf = stream_class(string)
    return xml.dom.pulldom.DOMEventStream(buf, parser, bufsize)

def parseString(string, parser=None):
    """Parse a file into a DOM from a string."""
    if parser is None:
        return xml.dom.minidom.parseString(string)

    return xml.dom.minidom._do_pulldom_parse(parseMaybeBytes, (string,),
                                             {'parser': parser})

Parsing XML using SAX Parser in Python 3

Answers (1)

Related Questions