Reputation: 22113
I am reading a ginormous (multi-gigabyte) XML file using Python's xml.etree.ElementTree
module's iterparse()
method. The problem is there are occasional Unicode errors (or at least what Python 3 thinks are Unicode errors) in some of the XML file's text. My loop is set up like this:
import xml.etree.ElementTree as etree
def foo():
# ...
f = open(filename, encoding='utf-8')
xmlit = iter(etree.iterparse(f, events=('start', 'end')))
(event, root) = next(xmlit)
for (event, elem) in xmlit: # line 26
if event != 'end':
continue
if elem.tag == 'foo':
do_something()
root.clear()
elif elem.tag == 'bar':
do_something_else()
root.clear()
# ...
When the element with the Unicode error is encountered, I get an error with the following traceback:
Traceback (most recent call last):
File "<path to above file>", line 26, in foo
for (event, elem) in xmlit:
File "C:\Python32\lib\xml\etree\ElementTree.py", line 1314, in __next__
self._parser.feed(data)
File "C:\Python32\lib\xml\etree\ElementTree.py", line 1668, in feed
self._parser.Parse(data, 0)
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 16383: surrogates not allowed
Since the error occurs in between for
loop iterations, the only place I can wrap a try
block is outside the for
loop, which would mean I cannot continue to the next XML element.
My priorities for a solution are as follows:
How can I implement any of these solutions, without going and modifying the ElementTree
code myself?
Upvotes: 4
Views: 6410
Reputation: 365707
First, all the stuff about ElementTree is probably irrelevant here. Try just enumerating the file returned by f = open(filename, encoding='utf-8')
, and you will probably get the same error.
If so, the solution is to override the default encoding error handler, as explained in the docs:
errors is an optional string that specifies how encoding and decoding errors are to be handled–this cannot be used in binary mode. Pass 'strict' to raise a ValueError exception if there is an encoding error (the default of None has the same effect), or pass 'ignore' to ignore errors. (Note that ignoring encoding errors can lead to data loss.) 'replace' causes a replacement marker (such as '?') to be inserted where there is malformed data. When writing, 'xmlcharrefreplace' (replace with the appropriate XML character reference) or 'backslashreplace' (replace with backslashed escape sequences) can be used. Any other error handling name that has been registered with codecs.register_error() is also valid.
So, you can do this:
f = open(filename, encoding='utf-8', errors='replace')
This fits your second priority—the invalid characters will be replaced by '?'
.
There is no way to fit your first priority, because there's no way to represent a "not-necessarily-valid Unicode string". A Unicode string is, by definition, a sequence of Unicode code points, and that's how Python treats the str
type. If you have invalid UTF-8 and want to turn that into a string, you need to specify how it should be turned into a string—and that's what, errors
is for.
You could, alternatively, open the file in binary mode, and leave the UTF-8 alone as a bytes
object instead of trying to turn it into a Unicode str
object, but then you can only use APIs that work with bytes
objects. (I believe the lxml
implementation of ElementTree
can actually do this, but the built-in one can't, but don't quote me on that.) But even if you did that, it wouldn't get you very far, because the XML code itself is going to try to interpret the invalid UTF-8, and then it needs to know what you want to do with errors, and that's usually going to be harder to specify because it's farther down.
One last point:
Since the error occurs in between for loop iterations, the only place I can wrap a try block is outside the for loop, which would mean I cannot continue to the next XML element.
Well, you don't actually have to use a for
loop; you can transform it into a while
loop with explicit next
calls. Any time you need to do this, it's usually a sign that you're doing something wrong—but sometimes it's a sign that you're dealing with a broken library, and it's the only workaround available.
This:
for (event, elem) in xmlit: # line 26
doStuffWith(event, elem)
Is effectively equivalent to:
while True:
try:
event, elem = next(xmlit)
except StopIteration:
break
doStuffWith(event, elem)
And now, there is an obvious place to add a try
—although you don't even really need to; you can just attach another except
to the existing try
.
The problem is, what are you going to do here? There is no guarantee that the iterator will be able to continue after it throws an exception. In fact, all of the simplest ways to create iterators will not be able to do so. You can test for yourself whether that's true in this case.
In the rare cases when you need to this, and it actually helps, you'd probably want to wrap it up. Something like this:
def skip_exceptions(it):
while True:
try:
yield next(it)
except StopIteration:
raise
except Exception as e:
logging.info('Skipping iteration because of exception {}'.format(e))
Then you just do:
for (event, elem) in skip_exceptions(xmlit):
doStuffWith(event, elem)
Upvotes: 4