Bin Chen
Bin Chen

Reputation: 63359

Python unicode: why in one machine works but in another one it failed sometimes?

I found unicode in python really troublesome, why not Python use utf-8 for all the strings? I am in China so I have to use some Chinese string that can't represent by ascii, I use u'' to denote a string, it works well in my ubuntu machine, but in another ubuntu machine (VPS provided by linode.com), it fails some times. The error is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

The code I am using is:

self.talk(user.record["fullname"] + u"准备好了")

Upvotes: 4

Views: 2986

Answers (4)

Glen Bizeau
Glen Bizeau

Reputation: 73

It took me a long time, but I found it.

look at PRINTENV, specially LANG

LANG=en_CA <- server 2 (not working)

LANG=en_US.UTF-8 <- server 1 (working) "On Linode coincidentally)

Set new Locals

sudo update-locale LANG=en_US.UTF-8 LANGUAGE

Log out, back in, bob's your uncle :)

Upvotes: 0

mouad
mouad

Reputation: 70059

The thing with the famous UnicodeDecodeError is when you do some string manipulation like the one you did just now:

user.record["fullname"] + u" 准备好了"

because what you're doing is concatenating an str with unicode , so python will do an implicit coercion of the str to an unicode before doing the concatenation this coercion is done like this:

unicode(user.record["fullname"]) + u" 准备好了"
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
         Problem

And there is the problem because when doing unicode(something) python will decode the string using the default encoding which is ASCII in python 2.* and if it happen that your string user.record["fullname"] have some no-ASCII character it will raise the famous UnicodeDecodeError error.

so how you can solve it :

# Decode the str to unicode using the right encoding
# here i used utf-8 because mostly is the right one but maybe it not (another problem!!!)
a = user.record["fullname"].decode('utf-8')

self.talk(a + u" 准备好了")

PS: Now in python 3 the default encoding is utf-8 and one other thing you can't do a concatenation of a unicode with the string (byte in python 3.) so no more implicit coercion

Upvotes: 12

Rosh Oxymoron
Rosh Oxymoron

Reputation: 21065

You need to decode all non-Unicode strings as early as possible. Try to ensure you have no UTF-8 bytestrings stored anywhere in memory, and you have only unicode objects. For example, make sure that the elements of user.record are all converted to unicode on creation, so you don't get any errors like this one. Or just use Python 3 where it's hard to mix them.

Upvotes: 1

ismail
ismail

Reputation: 47662

Because for Python 2.x the default encoding is ASCII unless its changed manually. Here is a crude hack to include in your script before any other code

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

This will change default Python encoding to UTF-8.

Upvotes: 0

Related Questions