Python Encoding Error, not unicode string

Question

How to get rid of the "u" without having other encoding problems ?

u"Example Characters : \xc3\xa9 \xc3\xa0"

Here what it prints :

Example Characters : Ã© Ã

Instead of :

Example Characters : é à

I encounter this problem when using getText() on a BeautifulSoup element. (The webpage is in UTF-8)

Martijn Pieters · Accepted Answer

You have a Mojibake (wrong decoding of the input).

You most likely passed a Unicode string to BeautifulSoup(). Don't do this, leave decoding to BeautifulSoup.

For example, if you used requests, use response.content, not response.text to pass the HTML to BeautifulSoup(). Otherwise you run the risk of the result being decoded as Latin-1, the default encoding for text responses over HTTP without an explicit character set mentioned in the headers. If you used urllib2, don't decode first.

BeatifulSoup detects the encoding and decodes for you; it'll use HTML tags if present. UTF-8 should be autodetected correctly. If you know the encoding up front and BeautifulSoup got it wrong anyway, use from_encoding to specify the correct encoding:

soup = BeautifulSoup(htmlsource, from_encoding='utf8')

See the Encodings section of the BeautifulSoup documentation.

If after all that you are still getting Mojibake results then the web page itself has produced data with incorrectly encoded values. In that case you can undo the error with:

mojibake_string.encode('latin1').decode('utf8')

This re-interprets the characters in the correct encoding:

>>> u"Example Characters : \xc3\xa9 \xc3\xa0".encode('latin1').decode('utf8')
u'Example Characters : \xe9 \xe0'
>>> print _
Example Characters : é à

There is no need to be concerned about the u prefix; that is just a type indicator, to show you have a Unicode value.

Python Encoding Error, not unicode string

Answers (2)

Related Questions