How to get around Unicode errors in xml.etree.ElementTree.iterparse()?

Question

I am reading a ginormous (multi-gigabyte) XML file using Python's xml.etree.ElementTree module's iterparse() method. The problem is there are occasional Unicode errors (or at least what Python 3 thinks are Unicode errors) in some of the XML file's text. My loop is set up like this:

import xml.etree.ElementTree as etree

def foo():
    # ...
    f = open(filename, encoding='utf-8')
    xmlit = iter(etree.iterparse(f, events=('start', 'end')))
    (event, root) = next(xmlit)
    for (event, elem) in xmlit: # line 26
        if event != 'end':
            continue
        if elem.tag == 'foo':
            do_something()
            root.clear()
        elif elem.tag == 'bar':
            do_something_else()
            root.clear()
    # ...

When the element with the Unicode error is encountered, I get an error with the following traceback:

Traceback (most recent call last):
  File "", line 26, in foo
    for (event, elem) in xmlit:
  File "C:\Python32\lib\xml\etree\ElementTree.py", line 1314, in __next__
    self._parser.feed(data)
  File "C:\Python32\lib\xml\etree\ElementTree.py", line 1668, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 16383: surrogates not allowed

Since the error occurs in between for loop iterations, the only place I can wrap a try block is outside the for loop, which would mean I cannot continue to the next XML element.

My priorities for a solution are as follows:

Receive a not-necessarily-valid Unicode string as the element's text, instead of having an exception raised.
Receive a valid Unicode string with the invalid character replaced or removed.
Skip the element with the invalid character and move on to the next one.

How can I implement any of these solutions, without going and modifying the ElementTree code myself?

How to get around Unicode errors in xml.etree.ElementTree.iterparse()?

Answers (1)

Related Questions