Reputation: 774
I am iterating through every wikipedia page that deals with a date (january 1, january 2, ...., december 31). On each page, I am taking out the names of people who have a birthday on that day. However, halfway through my code (April 27), I receive this warning:
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Then, I get an error right away:
Traceback (most recent call last):
File "wikipedia.py", line 29, in <module>
section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'
Basically, I cant figure out why, after I get all the way to April 27, that it decides to throw this warning and error. Here is the April 27 page:
From what I can tell, nothing is different there that would make this happen this way. There is still a span with id="Births".
Here's my code where I call all that stuff:
site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
section = soup.find('span', id='Births').parent
births = section.find_next('ul').find_all('li')
for x in births:
#All the regex and parsing, don't think it's necessary to show
The error is thrown on the line that reads:
section = soup.find('span', id='Births').parent
I do have a lot of information by the time I get to April 27 (8 lists of ~35,000 elements each), but I don't think that would be the issue. If anyone has any ideas, I'd appreciate it. Thanks
Upvotes: 1
Views: 3295
Reputation: 37344
It looks like the Wikipedia server is providing that page gzipped:
>>> page.info().get('Content-Encoding')
'gzip'
It's not supposed to without an accept-encoding header in your request, but, well, that's life when working with other people's servers.
There are a lot of sources out there showing how to work with gzipped data - here's one: http://www.diveintopython.net/http_web_services/gzip_compression.html
And here's another: Does python urllib2 automatically uncompress gzip data fetched from webpage?
Upvotes: 4