Random HTTP 503 Error using urllib and BeautifulSoup

Question

I'm scraping a website with cookies. They provide multiple drop-down menus and I'm iterating through each option and re-capturing the session cookies with every request. The code runs just fine for a while, but I randomly get a 503 error.

My code inserts data into a PostgreSQL database, and to help emphasize the randomness of this error I want to share that I've received the 503 after inserting as few as 1200 entries (rows) and as many as 4200. There doesn't seem to be any pattern to the raising of the this exception. I can't make sense of it.

If it helps, here is a portion of my code:

# -*- coding: utf-8 -*-

import scrape_tools
import psycopg2
import psycopg2.extras
import urllib
import urllib2
import json
import cookielib
import time


tools = scrape_tools.tool_box()
db = tools.db_connect()
psycopg2.extras.register_hstore(db)
cursor = db.cursor(cursor_factory = psycopg2.extras.RealDictCursor)

cookiejar = cookielib.CookieJar()
opener = urllib2.build_opener(
    urllib2.HTTPRedirectHandler(),
    urllib2.HTTPHandler(debuglevel=0),
    urllib2.HTTPSHandler(debuglevel=0),
    urllib2.HTTPCookieProcessor(cookiejar),
)

url ='http://www.website.com/'
soup = tools.request(url)

type_select = soup('select',{'id':'type'})
for option_tag in type_select:
    select_option = option_tag('option')
    for option_contents in select_option:
        if 'Select' in option_contents.contents[0]:
            continue
        type = option_contents.contents[0]
        type_val = option_contents['value']
        print 'Type', type

        get_more_url = 'http://www.website.com/' + type_val
        request2 = urllib2.Request(get_more_url)
        fp2 = opener.open(request2)
        html2_object = fp2.read()
        json_result = json.loads(html2_object)

        for json_dict in json_result:
            for json_key in json_dict:
                if len(json_key) == 0:
                    continue
                more_data = json_dict[json_key]
                print '   ', more_data

               (---Out of courtesy, I'll stop here--)

(*Please note, scrape_tools is a custom module)

Am I missing something with cookie storage? Am I missing something obvious? I can't seem to figure out why this is happening. I've 'googled', 'stackoverflowed', etc. for hours trying to find somebody having similar issues, but haven't found anything.

I've also used selenium to scrape data in the past and have that in my pocket as a last resort, but this project is huge and I'd rather not have Firefox eating up memory on the server for a week.

Jamey Sharp · Accepted Answer

HTTP status 503, "Service Unavailable", means that for some reason the server wasn't able to process your request--but it's usually a transient error. If you wait a bit and retry the same request, it will probably work.

You do need to be able to handle this kind of transient failure in large-scale scraping jobs, because the Internet is full of transient errors. Connections fail or are dropped all the time. A simple retry policy is usually all you need though.

Status 503 could specifically mean that you're requesting pages too quickly, though. If you don't have a delay between page fetches, you should add one, for politeness' sake.

Random HTTP 503 Error using urllib and BeautifulSoup

Answers (1)

Related Questions