Process a page with utf-8

Question

I am just playing with urllib2 and pages with utf-8.

http://www.columbia.edu/~fdc/utf8/

Only getting the first 700 bytes (top segment)

>>> import urllib2
>>> from urllib2 import HTTPError, URLError
>>> import BaseHTTPServer
>>> opener = urllib2.OpenerDirector()
>>> opener.add_handler(urllib2.HTTPHandler())
>>> opener.add_handler(urllib2.HTTPDefaultErrorHandler())
>>> response = opener.open('http://www.columbia.edu/~fdc/utf8/')
>>> content = response.read(700)

Now from here, I would think that the string in the content var would be utf-8 encoded, and should display pretty fine.

however

>>> content
'




UTF-8 Sampler


UTF-8 SAMPLER

  \xc2\xa5 \xc2\xb7 \xc2\xa3 \xc2\xb7 \xe2\x82\xac \xc2\xb7 $ \xc2\xb7 \xc2\xa2 \xc2\xb7 \xe2\x82\xa1 \xc2\xb7 \xe2\x82\xa2 \xc2\xb7 \xe2\x82\xa3 \xc2\xb7 \xe2\x82\xa4 \xc2\xb7 \xe2\x82\xa5 \xc2\xb7 \xe2\x82\xa6 \xc2\xb7 \xe2\x82\xa7 \xc2\xb7 \xe2\x82\xa8 \xc2\xb7 \xe2\x82\xa9 \xc2\xb7 \xe2\x82\xaa \xc2\xb7 \xe2\x82\xab \xc2\xb7 \xe2\x82\xad \xc2\xb7 \xe2\x82\xae \xc2\xb7 \xe2\x82\xaf \xc2\xb7 ₹





Frank da Cruz



Seems html escaped, so

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape(content)
Traceback (most recent call last):
  File "", line 1, in 
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 390, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)


So I don't understand.
I even tried doing .encode('utf-8') being unescaping, but similar error.

What is the best way to display utf-8 content from a website?

Martijn Pieters · Accepted Answer

You need to decode the page from UTF-8 to Unicode; there are UTF-8 sequences in there (next to non-breaking-space HTML entities):

>>> print h.unescape(content.decode('utf8'))





UTF-8 Sampler


UTF-8 SAMPLER

  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹





Frank da Cruz



You got encoding and decoding confused; the content is already UTF-8 encoded.

Note that the ₹ is an error in the page itself, the ; was omitted. A HTML5 parser or browser would probably assume that the ; can be added and decode it anyway:

>>> print h.unescape('₹')
₹


You'd have to fix those entities with a regular expression first:

>>> import re
>>> brokenrefs = re.compile(r'(&#x?[a-e0-9]+)\b', re.I)
>>> print h.unescape(brokenrefs.sub(r'\1;', content.decode('utf8')))





UTF-8 Sampler


UTF-8 SAMPLER

  ¥ · £ · € · $ · ¢ · ₡ · ₢ · ₣ · ₤ · ₥ · ₦ · ₧ · ₨ · ₩ · ₪ · ₫ · ₭ · ₮ · ₯ · ₹





Frank da Cruz

Process a page with utf-8

Answers (2)

Related Questions