Reputation: 681
I'm using Beautiful Soup 3 and Python 2.7 for scraping utf-8 encoded web pages that contain non-ASCII characters (umlauts). I'm getting the text that I want, but all Unicode characters are returned as two-byte character sequences instead of the actual Unicode character. (The string is obtained by using soup.find() and converting the the NavigableString results into a string with str().)
For example: I get Fahrvergnügen instead of Fahrvergnügen.
I've tried pretty much all permutations of encode('utf-8'), decode('utf-8') and unicode() but nothing returns the umlaut instead of the weird two-byte sequence.
I'm pretty sure that there's a simple solution, I just can't figure out what command to use to convert a BS NavigableString or a plain old string that contains Fahrvergnügen to Fahrvergnügen or ensure that the weird two-byte sequences aren't returned in the first place.
BTW, ü is C3BC, however, the code for a lower case u umlaut is 00FC.
Upvotes: 0
Views: 637
Reputation: 189357
The characters you are looking at look like double-encoded UTF-8. If the input is hosed, there really isn't anything BeautifulSoup can do to rectify it.
BeautifulSoup basically returns Unicode always, which is just as it should be (unless you are actually into manipulating encodings, in which case it's a hopeless hassle).
It is possible, though unlikely, that BeautifulSoup is the source for the double-encoding. You can override the character set of the scraped page if you are certain that it is properly UTF-8; use BeautifulSoup(..., fromEncoding='utf-8')
when creating the BeautifulSoup object.
"Fahrvergnügen" in UTF-8 is represented by the bytes 46 61 68 72 76 65 72 67 6e c3 bc 67 65 6e (hex) where c3 bc is the UTF-8 encoding of U+00FC.
When this string is incorrectly converted as if it were in a legacy 8-bit encoding such as ISO-8859-1 (where 0xc3 is à and 0xbc is ¼) the result is 46 61 68 72 76 65 72 67 6e c3 83 c2 bc 67 65 6e which is presumably what you are looking at.
You can revert this double-encoding if you know precisely the nature of the error, but this is not (straightforwardly) automatizable -- you need to examine every encoding error and figure out (or guess) which characters it is properly supposed to represent.
Upvotes: 4