urllib encoding issues

Question

I'm developing a webcrawler to automatically download some documents on a brazilian website. And it uses some unknown encoding (no charset defined in the head tag).

With some very very little effort people can read the documents. But the real problem is, the page listing the documents uses links with urls containing accentuated characters. But, without knowing the encoding of the page, when I retrieve it from urllib2.urlopen, the characters are all messed up.

e.g. Í characters come as Cyrillic capital letter E.

I'm using BeautifulSoup and prettify doesn't work since urllib2 already returns the document with the bad characters.

And one more thing: soup.originalEncoding returns None.

How can I set the urllib2.urlopen to either recognize the charset or set an "expected encoding" so it returns the characters as it is displayed on the browser?

BigHandsome · Accepted Answer

The character set can be retrieved from the header. I would give you the code I use, but it is derived from How to download any(!) webpage with correct charset in python?. And, he does a way better job of explaining the process. So, I will just point you there.

urllib encoding issues

Answers (1)

Related Questions