Beautiful Soup, gets warning and then error halfway through code

Question

I am iterating through every wikipedia page that deals with a date (january 1, january 2, ...., december 31). On each page, I am taking out the names of people who have a birthday on that day. However, halfway through my code (April 27), I receive this warning:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

Then, I get an error right away:

Traceback (most recent call last):
    File "wikipedia.py", line 29, in 
        section = soup.find('span', id='Births').parent
AttributeError: 'NoneType' object has no attribute 'parent'

Basically, I cant figure out why, after I get all the way to April 27, that it decides to throw this warning and error. Here is the April 27 page:

April 27...

From what I can tell, nothing is different there that would make this happen this way. There is still a span with id="Births".

Here's my code where I call all that stuff:

    site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
    hdr = {'User-Agent': 'Mozilla/5.0'}
    req = urllib2.Request(site,headers=hdr)    
    page = urllib2.urlopen(req)
    soup = BeautifulSoup(page)

    section = soup.find('span', id='Births').parent
    births = section.find_next('ul').find_all('li')

    for x in births:
        #All the regex and parsing, don't think it's necessary to show

The error is thrown on the line that reads:

section = soup.find('span', id='Births').parent

I do have a lot of information by the time I get to April 27 (8 lists of ~35,000 elements each), but I don't think that would be the issue. If anyone has any ideas, I'd appreciate it. Thanks

Peter DeGlopper · Accepted Answer

It looks like the Wikipedia server is providing that page gzipped:

>>> page.info().get('Content-Encoding')
'gzip'

It's not supposed to without an accept-encoding header in your request, but, well, that's life when working with other people's servers.

There are a lot of sources out there showing how to work with gzipped data - here's one: http://www.diveintopython.net/http_web_services/gzip_compression.html

And here's another: Does python urllib2 automatically uncompress gzip data fetched from webpage?

Beautiful Soup, gets warning and then error halfway through code

Answers (1)

Related Questions