Reputation: 43437
I am having trouble with unicode in a script I am writing. I have scoured the internet, including this site and I have tried many things, and I still have no idea what is wrong.
My code is very long, but I will show an excerpt from it:
raw_results = get_raw(args)
write_raw(raw_results)
parsed_results = parse_raw(raw_results)
write_parsed(parsed_results)
Basically, I get raw results, which is in XML, encoded in UTF-8. Writing the RAW data has no problems. But writing the parsed data is. So I am pretty sure the problem is inside the function that parses the data.
I tried everything and I do not understand what the problem is. Even this simple line gives me an error:
def parse_raw(raw_results)
content = raw_results.replace(u'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', u'')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 570: ordinal not in range(128)
Ideally I would love to be able to work with unicode and have no problems, but I also have no issue with replacing/ignoring any unicode and using only regular text. I know I have not provided my full code, but understand that it's a problem since it's work-related. But I hope this is enough to get me some help.
Edit: the top part of my parse_raw function:
from xml.etree.ElementTree import XML, fromstring, tostring
def parse_raw(raw_results)
raw_results = raw_results.decode("utf-8")
content = raw_results.replace('<?xml version="1.0" encoding="UTF-8" standalone="yes"?>', '')
content = "<root>\n%s\n</root>" % content
mxml = fromstring(content)
Edit2:: I think it would be a good idea to point out that the code works fine UNLESS there are special characters. When it's 100% English, no problem; whenever any foreign letters or accented letters are involved is when the issues arise.
Upvotes: 0
Views: 3383
Reputation: 43437
Thank you everyone for the input and the nudges. I have subsequently solved my own problem by going over my code for the millionth time with a fine-toothed comb, and I have found the culprit. And I have solved all my problems now.
For anyone with a similar problem, I have the following information that could help you:
codecs
module for writing your files.My problem was that at a certain point I was trying to turn unicode into unicode. And in another place I was trying to turn normal ASCII into ASCII again. So whenever I solved one issue, another arose and I figured it was the same problem.
Break your issue into sections... and then you might find your problem!
Upvotes: 0
Reputation: 879291
raw_results
is probably a str
object, not a unicode
object.
raw_results.replace(u'...', ...)
causes Python to first decode the str
raw_results
into a unicode
. Python2 uses the ascii
codec by default. raw_results
contains the byte '\xd7'
at position 570, which is not decodeable by the ascii
codec (i.e., it is not an ascii character).
Here is a demonstration of how this error might occur:
In [27]: '\xd7'.replace(u'a',u'b')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7 in position 0: ordinal not in range(128)
Whereas if raw_results
were unicode, there would be no silent decoding with ascii
, and therefore no error would occur:
In [28]: u'\xd7'.replace(u'a',u'b')
Out[28]: u'\xd7'
You can fix this problem by decoding raw_results
explicitly, provided you know the appropriate codec:
raw_results = raw_results.decode('latin-1')
latin-1
is just a guess. It might be correct if the character at position 570 is a multiplication symbol:
In [26]: print('\xd7'.decode('latin-1'))
×
Upvotes: 3