Reputation: 1011
I am doing some text processing with text scraped from the web. I was thinking of decoding the raw text before
raw_html= raw_html.decode("iso-8859-1")
And later encode to UTF so I would not have problems with the encoding...
raw_html= raw_html.encode("UTF-8")
The issue is that despite knowing the web page encoding I keep getting errors in the decode part...
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 302: ordinal not in range(128)
I will be dealing with many languages, but not so many web pages (so my idea of manually setting the encode). And I would like to be able to convert all the languages (english, french, spanish, Portuguese) to a common base to work with. What would you suggest?
Upvotes: 0
Views: 181
Reputation: 1122082
If raw_html.decode()
gives you an encoding exception, then it was already Unicode:
>>> u'é'.decode('latin1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
because Python 2 implicitly first tries to encode (with the default ASCII codec) when trying to 'decode' Unicode values.
Upvotes: 2