Reputation: 5013
I'm using HTMLParser to parse pages I pull down with urllib, and am coming across UnicodeDecodeError
exceptions when passing some to HTMLParser
.
I tried using chardet
to detect the encodings and to convert to ascii
, or utf-8
(the docs don't seem to say what it should be). lossiness is acceptable, but while the decode/encode lines work just fine, I always get the error after self.feed().
The information is there if I just print
it out.
from HTMLParser import HTMLParser
import urllib
import chardet
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
page = urllib.urlopen(query + search).read()
try:
self.feed(page)
except UnicodeDecodeError:
encoding = chardet.detect(page)['encoding']
if encoding != 'unicode':
page = page.decode(encoding)
page = page.encode('ascii', 'ignore')
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
here's the output:
Traceback (most recent call last):
File "test.py", line 27, in <module>
results = search_youtube(searches)
File "test.py", line 23, in __init__
self.feed(page)
File "/usr/lib/python2.6/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.6/HTMLParser.py", line 148, in goahead
k = self.parse_starttag(i)
File "/usr/lib/python2.6/HTMLParser.py", line 252, in parse_starttag
attrvalue = self.unescape(attrvalue)
File "/usr/lib/python2.6/HTMLParser.py", line 390, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.6/re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
Upvotes: 12
Views: 14210
Reputation: 172249
It is UTF-8, indeed. This works:
from HTMLParser import HTMLParser
import urllib
class search_youtube(HTMLParser):
def __init__(self, search_terms):
HTMLParser.__init__(self)
self.track_ids = []
for search in search_terms:
self.__in_result = False
search = urllib.quote_plus(search)
query = 'http://youtube.com/results?search_query='
connection = urllib.urlopen(query + search)
encoding = connection.headers.getparam('charset')
page = connection.read().decode(encoding)
self.feed(page)
print 'success'
searches = ['telepopmusik breathe']
results = search_youtube(searches)
print results.track_ids
You don't need chardet, Youtube are not morons, they actually send the correct encoding in the header.
Upvotes: 18
Reputation: 82934
What encoding does chardet say it is?
Please explain "The information is there if I just print it out": what is "it"? If you can read it and it makes sense when you print it to your console, then it must be in the usual/default encoding for your system; what is that? What operating system? What locale?
Can you give us a typical URL to make a query so that we can inspect for ourselves what you are seeing?
At one place in your code, you decode your output, then immediately smash it by using .encode('ascii', 'ignore')
; why?
Upvotes: 1