DavidK
DavidK

Reputation: 2564

Python Encoding Error, not unicode string

How to get rid of the "u" without having other encoding problems ?

u"Example Characters : \xc3\xa9 \xc3\xa0"

Here what it prints :

Example Characters : é Ã

Instead of :

Example Characters : é à

I encounter this problem when using getText() on a BeautifulSoup element. (The webpage is in UTF-8)

Upvotes: 0

Views: 508

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123480

You have a Mojibake (wrong decoding of the input).

You most likely passed a Unicode string to BeautifulSoup(). Don't do this, leave decoding to BeautifulSoup.

For example, if you used requests, use response.content, not response.text to pass the HTML to BeautifulSoup(). Otherwise you run the risk of the result being decoded as Latin-1, the default encoding for text responses over HTTP without an explicit character set mentioned in the headers. If you used urllib2, don't decode first.

BeatifulSoup detects the encoding and decodes for you; it'll use HTML <meta> tags if present. UTF-8 should be autodetected correctly. If you know the encoding up front and BeautifulSoup got it wrong anyway, use from_encoding to specify the correct encoding:

soup = BeautifulSoup(htmlsource, from_encoding='utf8')

See the Encodings section of the BeautifulSoup documentation.

If after all that you are still getting Mojibake results then the web page itself has produced data with incorrectly encoded values. In that case you can undo the error with:

mojibake_string.encode('latin1').decode('utf8')

This re-interprets the characters in the correct encoding:

>>> u"Example Characters : \xc3\xa9 \xc3\xa0".encode('latin1').decode('utf8')
u'Example Characters : \xe9 \xe0'
>>> print _
Example Characters : é à

There is no need to be concerned about the u prefix; that is just a type indicator, to show you have a Unicode value.

Upvotes: 4

tripleee
tripleee

Reputation: 189679

The string you created unambiguously contains the Unicode characters U+00C3, U+00A9, and U+00A0. Their printed representation is the string you say you don't want.

Apparently you are trying to embed a UTF-8 string. That's a byte string (b'...' in Python 3.x), not a Unicode string (u'...'). To get the string you actually wanted, try

"Example Characters : \xc3\xa9 \xc3\xa0".decode('utf-8')

which produces a Unicode string containing the actual characters you want.

See also http://nedbatchelder.com/text/unipain.html

Upvotes: 0

Related Questions