Reputation: 2564
How to get rid of the "u" without having other encoding problems ?
u"Example Characters : \xc3\xa9 \xc3\xa0"
Here what it prints :
Example Characters : é Ã
Instead of :
Example Characters : é à
I encounter this problem when using getText() on a BeautifulSoup element. (The webpage is in UTF-8)
Upvotes: 0
Views: 508
Reputation: 1123480
You have a Mojibake (wrong decoding of the input).
You most likely passed a Unicode string to BeautifulSoup()
. Don't do this, leave decoding to BeautifulSoup.
For example, if you used requests
, use response.content
, not response.text
to pass the HTML to BeautifulSoup()
. Otherwise you run the risk of the result being decoded as Latin-1, the default encoding for text responses over HTTP without an explicit character set mentioned in the headers. If you used urllib2
, don't decode first.
BeatifulSoup detects the encoding and decodes for you; it'll use HTML <meta>
tags if present. UTF-8 should be autodetected correctly. If you know the encoding up front and BeautifulSoup got it wrong anyway, use from_encoding
to specify the correct encoding:
soup = BeautifulSoup(htmlsource, from_encoding='utf8')
See the Encodings section of the BeautifulSoup documentation.
If after all that you are still getting Mojibake results then the web page itself has produced data with incorrectly encoded values. In that case you can undo the error with:
mojibake_string.encode('latin1').decode('utf8')
This re-interprets the characters in the correct encoding:
>>> u"Example Characters : \xc3\xa9 \xc3\xa0".encode('latin1').decode('utf8')
u'Example Characters : \xe9 \xe0'
>>> print _
Example Characters : é à
There is no need to be concerned about the u
prefix; that is just a type indicator, to show you have a Unicode value.
Upvotes: 4
Reputation: 189679
The string you created unambiguously contains the Unicode characters U+00C3, U+00A9, and U+00A0. Their printed representation is the string you say you don't want.
Apparently you are trying to embed a UTF-8 string. That's a byte string (b'...'
in Python 3.x), not a Unicode string (u'...'
). To get the string you actually wanted, try
"Example Characters : \xc3\xa9 \xc3\xa0".decode('utf-8')
which produces a Unicode string containing the actual characters you want.
See also http://nedbatchelder.com/text/unipain.html
Upvotes: 0