Reputation: 1195
I'm reading and parsing an Amazon XML file and while the XML file shows a ' , when I try to print it I get the following error:
'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128)
From what I've read online thus far, the error is coming from the fact that the XML file is in UTF-8, but Python wants to handle it as an ASCII encoded character. Is there a simple way to make the error go away and have my program print the XML as it reads?
Upvotes: 116
Views: 299278
Reputation: 44828
Likely, your problem is that you parsed it okay, and now you're trying to print the contents of the XML and you can't because theres some foreign Unicode characters. Try to encode your unicode string as ascii first:
unicodeData.encode('ascii', 'ignore')
the 'ignore' part will tell it to just skip those characters. From the python docs:
>>> # Python 2: u = unichr(40960) + u'abcd' + unichr(1972)
>>> u = chr(40960) + u'abcd' + chr(1972)
>>> u.encode('utf-8')
'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
'abcd'
>>> u.encode('ascii', 'replace')
'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
'ꀀabcd޴'
You might want to read this article: http://www.joelonsoftware.com/articles/Unicode.html, which I found very useful as a basic tutorial on what's going on. After the read, you'll stop feeling like you're just guessing what commands to use (or at least that happened to me).
Upvotes: 204
Reputation: 45
Python 3.5, 2018
If you don't know what the encoding but the unicode parser is having issues you can open the file in Notepad++
and in the top bar select Encoding->Convert to ANSI
. Then you can write your python like this
with open('filepath', 'r', encoding='ANSI') as file:
for word in file.read().split():
print(word)
Upvotes: -2
Reputation: 4043
If you need to print an approximate representation of the string to the screen, rather than ignoring those nonprintable characters, please try unidecode
package here:
https://pypi.python.org/pypi/Unidecode
The explanation is found here:
https://www.tablix.org/~avian/blog/archives/2009/01/unicode_transliteration_in_python/
This is better than using the u.encode('ascii', 'ignore')
for a given string u
, and can save you from unnecessary headache if character precision is not what you are after, but still want to have human readability.
Wirawan
Upvotes: 0
Reputation: 2444
Excellent post : http://www.carlosble.com/2010/12/understanding-python-and-unicode/
# -*- coding: utf-8 -*-
def __if_number_get_string(number):
converted_str = number
if isinstance(number, int) or \
isinstance(number, float):
converted_str = str(number)
return converted_str
def get_unicode(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode
return unicode(strOrUnicode, encoding, errors='ignore')
def get_string(strOrUnicode, encoding='utf-8'):
strOrUnicode = __if_number_get_string(strOrUnicode)
if isinstance(strOrUnicode, unicode):
return strOrUnicode.encode(encoding)
return strOrUnicode
Upvotes: 2
Reputation: 223
Try adding the following line at the top of your python script.
# _*_ coding:utf-8 _*_
Upvotes: -1
Reputation: 9
I wrote the following to fix the nuisance non-ascii quotes and force conversion to something usable.
unicodeToAsciiMap = {u'\u2019':"'", u'\u2018':"`", }
def unicodeToAscii(inStr):
try:
return str(inStr)
except:
pass
outStr = ""
for i in inStr:
try:
outStr = outStr + str(i)
except:
if unicodeToAsciiMap.has_key(i):
outStr = outStr + unicodeToAsciiMap[i]
else:
try:
print "unicodeToAscii: add to map:", i, repr(i), "(encoded as _)"
except:
print "unicodeToAscii: unknown code (encoded as _)", repr(i)
outStr = outStr + "_"
return outStr
Upvotes: 0
Reputation: 414875
Don't hardcode the character encoding of your environment inside your script; print Unicode text directly instead:
assert isinstance(text, unicode) # or str on Python 3
print(text)
If your output is redirected to a file (or a pipe); you could use PYTHONIOENCODING
envvar, to specify the character encoding:
$ PYTHONIOENCODING=utf-8 python your_script.py >output.utf8
Otherwise, python your_script.py
should work as is -- your locale settings are used to encode the text (on POSIX check: LC_ALL
, LC_CTYPE
, LANG
envvars -- set LANG
to a utf-8 locale if necessary).
Upvotes: 9
Reputation: 758
A better solution:
if type(value) == str:
# Ignore errors even if the string is not proper UTF-8 or has
# broken marker bytes.
# Python built-in function unicode() can do this.
value = unicode(value, "utf-8", errors="ignore")
else:
# Assume the value object has proper __unicode__() method
value = unicode(value)
If you would like to read more about why:
http://docs.plone.org/manage/troubleshooting/unicode.html#id1
Upvotes: 17
Reputation: 131790
You can use something of the form
s.decode('utf-8')
which will convert a UTF-8 encoded bytestring into a Python Unicode string. But the exact procedure to use depends on exactly how you load and parse the XML file, e.g. if you don't ever access the XML string directly, you might have to use a decoder object from the codecs
module.
Upvotes: 0