Reputation: 26288
Consider this function:
def escape(text):
print repr(text)
escaped_chars = []
for c in text:
try:
c = c.decode('ascii')
except UnicodeDecodeError:
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
It should escape all non ascii characters by the corresponding htmlentitydefs. Unfortunately python throws
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)
when the variable text
contains the string whose repr()
is u'Tam\xe1s Horv\xe1th'
.
But, I don't use str.encode()
. I only use str.decode()
. Do I miss something?
Upvotes: 4
Views: 10602
Reputation: 1
I found solution in this-site
reload(sys) sys.setdefaultencoding("latin-1") a = u'\xe1' print str(a) # no exception
Upvotes: 0
Reputation: 12075
This answer always works for me when I have this problem:
def byteify(input):
'''
Removes unicode encodings from the given input string.
'''
if isinstance(input, dict):
return {byteify(key):byteify(value) for key,value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
from How to get string objects instead of Unicode ones from JSON in Python?
Upvotes: 2
Reputation: 11624
It's a misleading error-report which comes from the way python handles the de/encoding process. You tried to decode an already decoded String a second time and that confuses the Python function which retaliates by confusing you in turn! ;-) The encoding/decoding process takes place as far as i know, by the codecs-module. And somewhere there lies the origin for this misleading Exception messages.
You may check for yourself: either
u'\x80'.encode('ascii')
or
u'\x80'.decode('ascii')
will throw a UnicodeEncodeError, where a
u'\x80'.encode('utf8')
will not, but
u'\x80'.decode('utf8')
again will!
I guess you are confused by the meaning of encoding and decoding. To put it simple:
decode encode
ByteString (ascii) --------> UNICODE ---------> ByteString (utf8)
codec codec
But why is there a codec
-argument for the decode
method? Well, the underlying function can not guess which codec the ByteString was encoded with, so as a hint it takes codec
as an argument. If not provided it assumes you mean the sys.getdefaultencoding()
to be implicitly used.
so when you use c.decode('ascii')
you a) have a (encoded) ByteString (thats why you use decode) b) you want to get a unicode-representation-object (thats what you use decode for) and c) the codec in which the ByteString is encoded is ascii.
See also:
https://stackoverflow.com/a/370199/1107807
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
Upvotes: 11
Reputation: 19377
Python has two types of strings: character-strings (the unicode
type) and byte-strings (the str
type). The code you have pasted operates on byte-strings. You need a similar function to handle character-strings.
Maybe this:
def uescape(text):
print repr(text)
escaped_chars = []
for c in text:
if (ord(c) < 32) or (ord(c) > 126):
c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)])
escaped_chars.append(c)
return ''.join(escaped_chars)
I do wonder whether either function is truly necessary for you. If it were me, I would choose UTF-8 as the character encoding for the result document, process the document in character-string form (without worrying about entities), and perform a content.encode('UTF-8')
as the final step before delivering it to the client. Depending on the web framework of choice, you may even be able to deliver character-strings directly to the API and have it figure out how to set the encoding.
Upvotes: 2
Reputation: 600041
You're passing a string that's already unicode. So, before Python can call decode
on it, it has to actually encode it - and it does so by default using the ASCII encoding.
Edit to add It depends on what you want to do. If you simply want to convert a unicode string with non-ASCII characters into an HTML-encoded representation, you can do it in one call: text.encode('ascii', 'xmlcharrefreplace')
.
Upvotes: 5