nlper
nlper

Reputation: 2407

'ascii' codec can't encode character u'\xe9'

I already tried all previous answers and solution.

I am trying to use this value, which gave me encoding related error.

ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']

So I tried,

d = [x.decode('utf-8') for x in ar]

which gives:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)

I tried out

 d = [x.encode('utf-8') for x in ar]

which removes error but changes the original content

original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode

what is correct way to deal with this scenario?

Edit

Error comes when I feed these links in

req = urllib2.Request()

Upvotes: 1

Views: 14691

Answers (3)

Alastair McCormack
Alastair McCormack

Reputation: 27744

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno

Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.

For example, your trouble chars were é and í.

é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8

In short, your .encode() method is correct and should be used for writing to files or to a browser.

Upvotes: 0

Sid Shukla
Sid Shukla

Reputation: 1030

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world. To encode use the quote function from the urllib2 library:

from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))

To decode, use unquote:

from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')

Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

Upvotes: 2

bruno desthuilliers
bruno desthuilliers

Reputation: 77912

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).

Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Upvotes: 2

Related Questions