Unicode vs encoding python text processing

Question

I am doing some text processing with text scraped from the web. I was thinking of decoding the raw text before

raw_html=  raw_html.decode("iso-8859-1")

And later encode to UTF so I would not have problems with the encoding...

raw_html=  raw_html.encode("UTF-8")

The issue is that despite knowing the web page encoding I keep getting errors in the decode part...

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 302: ordinal not in range(128)

I will be dealing with many languages, but not so many web pages (so my idea of manually setting the encode). And I would like to be able to convert all the languages (english, french, spanish, Portuguese) to a common base to work with. What would you suggest?

Martijn Pieters · Accepted Answer

If raw_html.decode() gives you an encoding exception, then it was already Unicode:

>>> u'é'.decode('latin1')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

because Python 2 implicitly first tries to encode (with the default ASCII codec) when trying to 'decode' Unicode values.

Unicode vs encoding python text processing

Answers (1)

Related Questions